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1 Editorial 


Katharina Morik 
Jorg Rahnenfiihrer 
Christian Wietfeld 


This is the third book of a series of books dedicated to the results of the DFG Collabora- 
tive Research Center 876 on “Machine Learning under Resource Constraints”. The first 
book of the series discusses fundamental innovations in the theory and algorithms of 
machine learning. The second book covers the use of machine learning in physics. This 
volume focuses on applications of the machine learning approaches presented in Book 
1. The main idea is to demonstrate with specific examples how machine learning has be- 
come essential as well as practical in solving real-life problems from diverse application 
areas, ranging from medicine and robotics to road traffic and communication networks. 
Various real-life example applications show the significant impact of using tailored 
machine learning methods to improve the performance of the respective processes 
and systems. A key boundary condition imposed by the real-life environments is that 
resources, such as energy, storage, computing power, computing time, etc., are often 
limited and that the practicability of the proposed machine learning solutions depends 
on the efficient use of those resources. Therefore, the success of the solutions discussed 
in this book must not only be measured in terms of performance gains but, at the same 
time, in terms of their resource efficiency and corresponding sustainability. For many 
domain experts, the sheer multitude of machine learning approaches makes it difficult 
to choose the “right” ones for a particular problem. While the availability of software 
tools lowers the entry barrier to use machine learning methods by non-experts, the 
application examples contained in this book demonstrate that truly significant impacts 
can often only be achieved by an interdisciplinary combination of domain knowledge 
and the appropriate usage of machine learning methods. Accordingly, this book aims 
to promote proficiency in the use of machine learning methods beyond the quick wins 
of arbitrarily using whatever approach happens to be in fashion. The applications de- 
scribed in this book will touch upon a multitude of machine learning options covering 
the complete process chain from data acquisition, feature extraction, model selection 
via various learning approaches to model verification and model validation. It will show, 
for example, that while the deep learning approaches popular today can be beneficial 
for many problems in some areas, alternative methods such as ensemble learning with 
random forests are more accurate with much less resource utilization in other areas. 
The first part of the book addresses the application area of health and medicine. After 
an overview of machine learning in medicine provided by the invited authors Catherine 
Jutzeler and Karsten Borgwardt, a number of results from the CRC 876 are presented, 
covering virus detection, protein analysis, and cancer diagnostics and therapy. The 
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second part of the book is dedicated to the application of machine learning for industry 
use cases. On the one hand, machine learning enables proactive quality assessments; 
on the other, its application for precise localization, energy harvesting, and swarm 
control demonstrates the potential of embedding machine learning methods in almost 
any element of the manufacturing and logistics process of present and future industry 
environments. In the third part of the book, various examples illustrate the significant 
potential of machine learning for smart city and traffic use cases, such as the prediction 
of traffic flows, the privacy-preserving detection of vehicle flows, and resource-efficient 
crowd sensing in smart cities. 

The fourth part of the book is about improving the performance of communication 
networks through machine learning. This includes new approaches for highly resource- 
efficient vehicle-to-cloud communications as well as machine learning-enabled mobile 
data network analytics and proper dimensioning of 5G network slices. As many appli- 
cations of machine learning involve personal data and may affect privacy concerns, 
this book also includes a chapter on a general methodology to classify and handle 
privacy aspects of data management as part of the machine learning process chain. 
This focus on the data handling complements the privacy-preserving machine learning 
techniques. With this broad spectrum of application and practical implementation ex- 
amples, we hope that this book will serve domain experts from diverse application areas 
as inspiration for the use of machine learning for their application-specific problems. 
To maximize the impact, many of the presented solutions are provided as open source 
published together with open data sets, allowing for reproducibility and sustainable 
transfer. At the same time, machine learning experts are expected to be motivated by 
the impressive impact of their work on real-life problems to further expand the machine 
learning solution space in terms of accuracy and resource efficiency. 


Dortmund, 14.10.2022 
Katharina Morik, Jorg Rahnenfiihrer and Christian Wietfeld 


2 Health / Medicine 


2.1 Machine Learning in Medicine 


Catherine Jutzeler 
Karsten Borgwardt 


Abstract: The combination of machine learning and population-scale health data holds 
the potential to revolutionize disease diagnosis and prognosis as well as to enable 
personalized predictions of therapy responses. The foundation for this unique op- 
portunity is the ever-increasing amount of complex high-dimensional health data, 
from the molecular level of genome sequences to the level of image phenotypes and 
health history, that is available in digital form and at high resolution for cohorts of 
hundreds of thousands—and soon millions—of individuals. Machine learning promises 
to be a key technology in the generation of new knowledge from this big health data, 
by detecting new statistical dependencies in large and multisource medical datasets. 
These new data-driven insights may drastically improve our abilities to predict disease 
onset early, define sub-types of diseases, and model disease progression and patient 
heterogeneity at an unprecedented level of detail, thereby supporting clinical decisions. 
Nevertheless, the practical implementation of the vision of machine learning-guided 
precision medicine faces considerable clinical challenges that have to be addressed 
in the future. In this contribution, we will describe the envisioned role of machine 
learning in the context of healthcare, critically discuss the challenges faced in terms of 
the implementation in the clinical routine, and outline future directions of this growing 
field. 


2.1.1 Introduction: The Envisioned Role of Machine Learning in Precision Medicine 


Precision medicine envisions that medical diagnosis, prognosis, and interventions can 
be tailored to the clinical, molecular, and genetic signature of individual patients [278]. 
One promising path towards precision medicine is to exploit the explosion of health 
data with modern computational approaches, in particular machine learning. Machine 
learning can be leveraged to detect hidden signals in digital health data (e.g., risk 
factors), uncover patterns or associations with certain diseases, and evaluate the out- 
come of treatments or interventions. Applications of machine learning have also been 
proposed to facilitate early disease recognition, refine diagnosis and prognosis, sup- 
port therapy decisions, and improve biomedical data management. An ever-increasing 
amount of data, from the molecular level of genome sequences to the level of image 
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phenotypes and health history, is available for rapidly growing cohorts of individuals. 
A prime example is the UK Biobank [658], which makes health data (genetic, molecular, 
imaging, clinical data) of more than half a million healthy people and patients available 
to the global research community. Exploring population-scale health data presents 
enormous opportunities for understanding disease mechanisms, ameliorating therapy 
outcomes, and ultimately improving healthcare. Machine learning and artificial intelli- 
gence offer the methods to mine and analyse the vast amounts of high-dimensional 
digital health data. For instance, designing computational models of diseases opens 
new opportunities to refine our understanding of diseases and their subtypes, discover 
novel biomarkers for early disease detection, and guide clinical decisions. A crucial step 
towards realizing the vision of precision and eventually personalized medicine, will be 
the ability of machine learning algorithms to simultaneously compute and consider a 
multitude of patient characteristics. The problem is exacerbated by the fact that current 
machine learning applications are often restricted by (1) a lack of patient data, let alone 
patient data with temporally-resolved clinical phenotypes; by (2) massive missingness 
in longitudinal, multi-modal patient data; by (3) the enormous effort required to com- 
bine data from different hospitals, with legal, information technology (IT), and data 
engineering challenges. 

The remainder of this contribution will provide an overview of machine learning 
applications in the field of medicine, with a special emphasis on early disease recogni- 
tion, diagnosis, prognosis, and therapy decisions. Moreover, we will critically discuss 
the challenges of machine learning-guided applications in the context of medicine and 
health care in general. Lastly, we conclude with an outlook of what the future may hold 
for machine learning-driven applications in the different areas of health care, such as 
diagnosis, prognosis, and therapy development. 


2.1.2 Overview of Common Topics in Machine Learning in Medicine 


The notion of advancing medicine by means of computation is almost as old as digital 
computers [438]. When deployed into the clinical routine, machine learning-guided 
approaches can facilitate early disease recognition, refine diagnosis and prognosis, 
support therapy decisions, and improve biomedical data management [161]. In this 
section, we discuss studies that illustrate the potential role of machine learning to 
tackle these tasks. 


2.1.3 Automation of Diagnoses and Treatment 
Depending on the disease type, time is often a limiting factor for the diagnosis and 


initiation of effective treatment. This is particularly true for patients facing serious 
medical conditions (e.g., cardiovascular complications, sepsis, cancer), which require 
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immediate attention and timely clinical decisions. A delay in the diagnostic work-up 
puts the patient at risk because the medical condition can get worse the longer it 
remains undiagnosed and untreated. To accelerate the diagnostic workup the medical 
field has increasingly used automation in diagnostics, surgical planning, and therapy 
selection (Table 2.1). Nowadays, machine learning models play an important role in 
the development and implementation of automation processes in the clinical routine. 
Large clinical datasets provide ample amounts of ’raw’ data from which machine 
learning algorithms can derive clinically relevant insights that can inform the diagnosis 
or treatment selection of a patient. Specific examples, which will be discussed in 
detail, are the timely identification of circulatory failure [293], automatic antimicrobial 
resistance prediction [722], cancer tumor recognition in radiology images [28, 123], and 
automation of treatment planning in oncology [713]. 

A prominent example for automated diagnosis is circulatory failure, which occurs 
when the arterial pressure and capillary stream are reduced for a prolonged period 
of time [89]. Subsequently, the functions of supplied organs are impaired or in the 
worst case even lost. As circulatory failure is common in critically ill patients, moni- 
toring of circulatory function is an indispensable aspect of the patient management 
in the Intensive Care Unit (ICU). Short-term effects of circulatory failure are usually 
reversible, whereas repeated or extended episodes of low arterial pressure adversely 
affect outcomes and worsen the prognosis [184, 537]. Therefore, the early recognition 
of circulatory failure is of highest priority. Conventional alarm systems to detect circu- 
latory failure do not utilize comprehensive patient information, which often lead to 
alarms that are non-specific or false [553, 619]. False or unspecific alarms can trigger 
“alarm fatigue” among intensive care practitioners [97]. In the ICU, large quantities of 
measurements from multiple monitoring systems are generated that carry clinically rel- 
evant information. While too complex to analyze for a human brain, machine learning 
applications thrive in data-rich environments, such as the ICU. Leveraging clinical and 
ICU monitoring data, Hyland and colleagues show that a machine learning algorithm 
based on an array of demographic, physiological, and clinical information is able to 
predict the circulatory failure of ICU patients several hours prior to its onset [293]. Their 
early-warning system outperforms current conventional threshold-based systems and 
has a significantly lower false-alarm rate. In order to learn to detect deterioration events 
from monitoring data, which are indicative of circulatory failure, different state-of-the- 
art supervised machine learning techniques were employed, including Light Gradient 
Boosting Machine (lightGBM) [315], Logistic Regression [283], and Long Short-Term 
Memory (LSTM) based recurrent neural network model [277]. When implemented in the 
clinic, such machine learning guided multi-modal early-warning systems are a first step 
towards (semi-) automation of the identification of patients at risk for the development 
of circulatory failure in the ICU, while avoiding “alarm fatigue” among the clinical staff. 

In addition to the early recognition of circulatory failure or other serious conditions 
(e.g., sepsis), a major challenge faced by intensive care practitioners is the escalating 
rates of antibiotic resistance in ICU patients. The administration of antibiotics is up to 
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Tab. 2.1: Selected examples of proposed machine learning approaches that could guide the automa- 
tion of early detection, disease diagnosis, and treatment planning. 


Disease 


Antibiotic 
resistance [722] 


Circulatory 
failure [293] 


COVID-19 [487] 


Diabetic 
retinopathy [364] 
Hyperlipidemia [760] 


Laparoscopic 
robotic surgery [35] 
Multiple 

sclerosis [659] 
Plastic and 
reconstructive 
surgery [329] 


Prostate 
cancer [569] 
Prostate 
cancer [464] 


Automated task 


Diagnostic and 
treatment support 


Early detection 


Detection 
Early detection 


Diagnosis 


Surgical path 
planning 
Detection 


Diagnosis and surgi- 
cal planning 


Therapy selection 


Therapy selection, 
dose optimization 


Input data 


MALDI-TOF mass spec- 
tra and antimicrobial 
resistance profiles, de- 
mographics 


Physiological parame- 
ters, blood values, vitals, 
monitoring data, demo- 
graphics 


X-ray images, demo- 
graphics, clinical data 
Fundoscope images, dia- 
betic retinopathy images 
Blood parameters, urine 
parameters, biochemical 
test parameters, blood 
sugar parameters, and 
glycosylated hemoglobin 
parameters 

Gall bladder images 


Brain MRI images, clini- 
cal data 

3D face surface scans, de- 
mographics 


Prostate cancer images, 
clinical data 

Prostate cancer 
computed tomography 
images, clinical data 


Analytical methods used 


Logistic Regression, 
Light Gradient Boosting 
Machine, Multilayer 


Perceptron Deep Neural 
Network 

Light Gradient Boost- 
ing Machine, Logistic 
Regression, and Long 
Short-Term Memory 
based Recurrent Neural 
Network model 

Deep Learning, Convolu- 
tional Neural Network 
Deep Learning, Convolu- 
tional Neural Network 
1-D Convolutional Neural 
Network 


Reinforcement Learning 


Deep Learning, Convolu- 
tional Neural Network 
Linear Regression, Ridge 
Regression, Least-Angle 
Regression, and Least Ab- 
solute Shrinkage and Se- 
lection Operator Regres- 
sion, Support Vector Ma- 
chine 

Deep Learning Neural 
Networks 
Generative 
Network 


Adversarial 
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ten times higher in ICU patients compared with non-ICU patients [542]. Moreover, the 
proportion of antimicrobial resistant isolates was found to be considerably higher in ICU 
patients than in patients on general medical floors [27]. Antibiotic resistance substan- 
tially adds to the morbidity, mortality, and healthcare cost related to infections in the 
ICU [118]. Timely initiation of effective antimicrobial treatment has emerged as a critical 
determinant of outcome in patients with bacterial infections. The selection of optimal 
treatment warrants an exact characterization of the underlying pathogen, including the 
determining of resistance profiles. As time is of the essence, streamlining the antimi- 
crobial resistance profiling is imperative. Matrix-Assisted Laser Desorption/Ionization 
Time-of-Flight (MALDI-TOF) mass spectrometry (MS) has become the standard rapid 
technology for the identification of microbial species, at least in specialised centers. 
Multiple studies have suggested that machine learning algorithms could be used to 
thoroughly exploit the information contained in MALDI-TOF MS spectra in order to 
expedite species identification and antimicrobial resistance determination [723]. How- 
ever, there is a lack of comprehensive information on marker mass for all existing 
pathogens and antibiotics, impeding the interpretation of MALDI-TOF spectra. Ina 
seminal study, Weis and colleagues used machine learning to harness the full potential 
of MALDI-TOF MS of microbial isolates to predict antimicrobial resistance [722]. Both of 
the implemented machine learning algorithms, Light Gradient Boosting Machine anda 
multilayer perceptron deep neural network, could reliably identify antimicrobial resis- 
tance of clinically important pathogens, including Staphylococcus aureus (S. aureus), 
Escherichia coli (E. coli), and Klebsiella pneumoniae (K. pneumoniae). Moreover, high 
predictive performance was observed for individual species—antibiotic combinations, 
such as ceftriaxone resistance in E. coli and K. pneumoniae and oxacillin resistance 
in S. aureus [722]. A retrospective clinical trial further demonstrated the clinical ben- 
efit of machine learning guided antibiotic resistance profiling. Based on the results 
provided by machine learning algorithms, the empiric antibiotic regimes would have 
changed for a subset of patients (= 15 %). Importantly, such a change would have been 
beneficial for the majority of patients. This study constitutes the first step of automatic 
phenotype determination, which promises to accelerate the diagnostic work up and 
guide treatment selection. Consequently, this will reduce the time from diagnosis to 
initiation of antibiotic treatment for ICU patients. 

Another medical condition that will likely benefit from machine learning guided 
automated diagnosis is lung cancer, which is the world’s the leading cause of cancer- 
related deaths [45]. Despite recent advances in the treatment of lung cancer, the overall 
5-year survival is still low for advanced stages with distant metastases [616]. Initial 
symptoms of lung cancer tend to be unspecific (e.g., coughing, fatigue) and thus, are 
easy to dismiss as inconsequential. Consequently, the majority of patients present at 
an advanced disease stage when curative treatment is out of reach. Early diagnosis 
of lung cancer is thus imperative to increase the likelihood of survival and treatment 
success [45]. A successful strategy to significantly reduce the mortality is the regular 
screening of at-risk populations [338]. Yet, up to one third of lung nodules are missed 
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at the initial screening, likely owing to low the sensitivity and specificity of current 
screening methods as well as the limits of human vision. Imaging (e.g., X-ray, computer 
tomography [CT], or positron emission tomography-computerized tomography [PET- 
CT]) is an integral part of the diagnostic workup for lung cancer. Evaluation of the 
images is based on a number of imaging attributes, including the nodule size, density, 
and growth [511]. The detection of informative imaging features is a prime example 
of an area in which machine learning and artificial intelligence can excel and be of 
great value to the clinicians in terms of precision and time effectiveness. In particular, 
approaches based on deep learning [236], a branch of artificial intelligence, are an 
intriguing option for automating the complex image analysis to detect subtle alterations 
that specialists might overlook. In a seminal study, Ardila and colleagues developed 
three-dimensional Convolutional Neural Network (CNN) models that perform end-to- 
end analysis of CT images of pathology-confirmed lung cancer images [28]. Importantly, 
the model learns the features of interest as opposed to previous models that use hand- 
engineered features. Learning features have been repeatedly shown to be superior 
to hand-engineered features [415, 554]. The developed model was demonstrated to 
generate highly accurate patient-level malignancy risk predictions, which has important 
potential for clinical relevance. If clinically validated, the results of this study may 
represent a step toward automated image evaluation for risk malignancy estimation by 
means of deep learning. Importantly, though deep learning systems might outperform 
human specialists on some diagnostic tasks, they will not replace the radiologists but 
provide diagnostic guidance. When making a clinical decision, clinicians take into 
account a variety of factors that are not necessarily captured in the input data used by 
the machine learning model. 

Along with early disease detection and phenotype detection, machine learning 
based automation has been demonstrated to be useful in the context of treatment 
planning. One striking example is Automated Therapy Planning (ATP) in patients with 
cancer that require radiotherapy [713]. The treatment success is highly dependent on the 
quality of the treatment plan. Inverse planning, a trial-and-error iterative process [512], 
is conventionally used to tailor radiotherapy treatment to the individual patients—a 
strategy that is strenuous and burdensome for patients. Machine learning and deep 
learning have gained momentum in the field as they are thought to improve the quality 
and efficiency of radiotherapy treatment planning. Specifically, the learning capability 
of these techniques enable oncologists to tailor the treatment plans to individual pa- 
tients based on patient-specific anatomical features and by incorporating knowledge 
from optimization methods or physicians’ behaviors. A variety of machine learning and 
deep learning techniques, from Artificial Neural Networks (ANN) [716], Convolutional 
Neural Networks (CNNs) [480], to Generative Adversarial Networks (GANs) [236], have 
been explored and incorporated in the different stages of radiotherapy treatment plan- 
ning [44, 395, 570, 603]. While these methods promise to refine the therapy planning of 
various types of cancer, there are some issues relating to patient safety as well as legal 
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and ethical responsibilities that have to be considered before deep learning-based ATP 
can be implemented in the clinical routine. 


2.1.4 Biomarker Discovery 


Biomarker discovery, the search for measurable and reproducible indicators of spe- 
cific clinical states, has been a major research avenue in recent years. A biomarker 
constitutes a measurable and reproducible indicator of specific clinical state or re- 
sponse to an intervention. In addition to refining disease diagnosis and prognosis [103, 
139, 286], biomarkers of all sorts (e.g., molecular, digital, imaging) are instrumental 
in discovering and defining therapeutic targets [211, 601]. Data from various sources, 
including electronic health records, patient monitoring, and imaging, can be lever- 
aged for biomarker discovery. With the emergence of affordable and time-efficient 
high-throughput molecular and gene expression profiling technologies (e.g., DNA mi- 
croarrays and RNA sequencing) [282], the search space for biomarkers has reached 
unprecedented dimensions and complexity. The challenging nature of these datasets 
(e.g., high dimensionality with large number of noisy features and low sample size) 
require suitable computational models that thrive in these complex, data-rich envi- 
ronments. In light of that, a variety of machine learning-guided biomarker discovery 
strategies have gained great popularity across different fields of medicine [112, 362, 456, 
667] (Table 2.2). 

At the core of biomarker discovery is the search for markers that can discriminate 
between samples or clinical characteristics of diseased patients and those of healthy 
subjects. Biomarker discovery is equivalent to feature selection in machine learning 
[595]. Feature selection algorithms are intended to reduced the dimensionality of the 
feature space by removing non-informative and redundant features, while retaining 
the informative features [595]. In general, feature selection algorithms can be used 
to (i) modify representations of data by extracting informative variables (i.e., feature 
extraction), (ii) create probabilistic models of disease progression, and (iii) determine 
what specific piece of (unknown) information, for instance laboratory tests, will be 
most valuable in optimizing the predictive ability of a model. Broadly speaking, there 
are two major strategies of feature selection. The first is univariate feature selection 
that investigates each feature one by one to determine the strength of the relationship 
with the outcome variable. Popular univariate feature selection methods include linear 
mixed models [416], support vector machine [764], and kernel-based measures [142]. 
Variants of these models tackle the challenging problem of feature selection from time 
series data [91, 92, 320] and can account for covariates or confounders [23, 419]. The 
second major strategy is multivariate feature selection, which considers whole groups 
of features together. Common multivariate prediction models are Lasso regression mod- 
els , tree ensembles (i.e., gradient boosting trees) [315], support vector machines, and 
Gaussian processes with kernels for comparing time series [430], neural networks from 
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Tab. 2.2: Selected examples of machine learning applications in the context of biomarker discovery. 


Disease 


Alzheimer’s disease 
[456] 
Autism [173] 


COVID-19 [428] 
COVID-19 [590] 


Dermatitis [211] 


Diabetes [261] 


Huntington’s 
disease [538] 


Lung cancer [734] 


Prostate 
cancer [284] 
Prostate 
cancer [139] 
Sepsis [454] 


Traumatic brain in- 
jury [448] 
Tuberculosis [363] 


Use of biomarker 


Disease 
progression 
Disease diagnosis 


Mortality prediction 
Mortality prediction 


Disease type 
discrimination 


Prediction of 


disease 
progression 


Early detection 


Early detection 


Screening and diag- 
nosis 

Diagnosis and prog- 
nosis 

Early detection 


Diagnosis 


Disease detection 


Input data 


MRI brain images, clini- 
cal data 
Functional connectivity, 
structural connectiv- 
ity, behavioral data, 
and brain activation 
measures 

Laboratory values, demo- 
graphics, medical history 
Laboratory data, clinical 
data, demographics 
Transcriptomics, skin 
biopsies, clinical data 


Physiological, biochemi- 
cal, and sequencing data 


Structural and functional 
MRI, diffusion weighted- 
imaging scans 

Plasma metabolomic 
data 


Microarray data, cancer 
tissue 
Proteomic data 


Physiological param- 
eters, blood values, 
vitals, monitoring data, 
demographics 
Structural MRI data, clini- 
cal data, demographics 
Chest X-Ray images, ra- 
diology reports, clinical 
data 


Analytical methods used 


Logistic Regression, Sup- 
port Vector Machine 
Recursive Cluster Elimina- 
tion based Support Vec- 
tor Machine 


Support Vector Machine 


Cox Proportional Hazard 
Model 

Multi-island Adaptive Ge- 
netic Algorithm, Principle 
Component Analysis 
Decision Trees, Logis- 
tic Regression, Linear 
Discriminant Analysis, 
K-Nearest Neighbors 
Classifier, Gaussian 
Naïve Bayes, and Sup- 
port Vector Machine 
Linear Discriminant Anal- 
ysis, Support Vector Ma- 
chine 

AdaBoost, K-nearest 
neighbor, Naïve Bayes, 
Neural Network, Random 
Forest, Support Vector 
Machine 

Artificial neural network 
Random Forest, Brute 
Force 

Gaussian Process Tem- 
poral Convolutional 
Networks and Dynamic 
Time Warping 

Principle Component 
Analysis, Random Forest 
Convolutional Neural Net- 
works 
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deep learning (long short-term memory [277], gated recurrent units [135], temporal con- 
volutional networks [38], and attention models [702]. Several of these techniques have 
been successfully applied to clinical outcome prediction over recent years, especially 
in the area of intensive care research [216, 293, 454, 455]. 

The ongoing coronavirus disease (COVID-19) pandemic has been a shining example 
of how machine learning can support and even guide the biomarker discovery. At the 
beginning of the pandemic, there were no recommendations or guidelines in place 
on how to manage COVID-19 patients or identify patients at risk. As a consequence, 
physicians in emergency and intensive care units were deemed to improvise on an indi- 
vidual patient level and administer off-label treatments. This was particularly difficult 
for patients who appeared to be on a disease trajectory towards recovery, but suddenly 
deteriorate at a speed that does not allow for timely targeted management. COVID-19 
has been associated with a high ‘failure-to-rescue’ rate (i.e., number of deaths of a 
patient following a complication) [618]. Further complicating COVID-19 disease man- 
agement was the limited understanding of the different clinical phenotypes associated 
with COVID-19 [672] and how forthcoming mutations of SARS-CoV-2 will modify the 
clinical manifestation and/or responses to current off-label treatments. In combatting 
the COVID-19 pandemic, massive global efforts have been undertaken to determine 
factors that are associated with the clinical presentation of the disease and its progres- 
sion [88, 205, 575], in-hospital mortality risk [428, 742], and treatment response [396]. 
With the availability of COVID-19-related clinical, imaging, and multi-omics data, clin- 
ically relevant biomarker signatures can be determined by means of computational 
modeling [428, 430, 590]. A recent study leveraged electronic clinical trial data from 69 
hospitals to develop a risk-scoring system for assessing COVID-19 related in-hospital 
mortality risk [590]. A wide range of biomarkers (age, pre-existing cardiovascular issues, 
blood markers) were found to be significantly associated with mortality outcomes. 

In conclusion, machine learning-driven applications can be found across various 
medical disciplines and bear great potential to advance health care as a whole. Never- 
theless, it is important to mention that there are many challenges (e.g., data privacy, 
quality of the data, generalizability of the models) that have to be tackled on the road 
to the clinical implementation. 


2.1.5 Biomedical Data Management 


As data collection and volume surges, machine learning has emerged as a key player in 
the data management, easing the burden of querying data source, as well as the curation 
and governance of data. In the context of healthcare, machine learning-guided data 
management ranges from genome assembly to managing national electronic health 
record systems. In general, data management is a labor- and time-intensive task that 
often involves repetitive steps, which can be (partially) automated by means of machine 
learning (Table 2.3). Machine learning algorithms pursue three major data management 
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goals: automation of time-consuming and iterative development tasks (cataloging data, 
mapping sources to targets, data preprocessing), optimization of system performance 
(table-join approaches), and capacity management (workload-aware autoscaling). 

For instance, preparing and cleaning the raw data and making it suitable for sub- 
sequent analysis is an integral part of creating any statistical or machine learning 
model. Data preprocessing entails different steps: datasets requests, data fusion, and 
data anomaly detection. Defining the quality of the data is an important step as it 
will directly impact the performance and reliability of machine learning algorithms. 
Anomaly detection aims to identify observations or data elements that raise suspicions 
as they significantly deviate from the majority of the data. Anomalous data can be 
indicative of the data-quality issues, nonstandard data, or outliers. Machine learning 
algorithms have the potential to automatically detect and remediate data-quality is- 
sues [138]. Specific examples of machine learning applied to anomaly detection and 
data cleaning are clustering [641], classification [707, 739], autoregression [758], and 
Bayesian statistics [162]. 


Tab. 2.3: Data management tasks that can be optimized by machine learning algorithms. 


Task Explanation 

Data cataloging and curation To override manual data labeling by using automatic 
labeling [202] 

Data preprocessing, anomaly detection To identify missing data, help identify and fix in- 


correct labels [620, 641, 758]; to identify observa- 
tions or data elements that raise suspicions because 
they significantly deviate from the majority of the 


data [641] 

Data mapping To match fields from multiple datasets into a manage- 
able and harmonized system [66] 

Feature engineering To create candidate features out of a dataset from 
which the best can be selected and used for train- 
ing [221] 


2.1.6 Challenges for Machine Learning Methods in the Context of Health Care 


Computational innovations are bound to transform health care. Nevertheless, there are 
a number of challenges that have to be tackled in order to successfully adopt machine 
learning in the clinical setting. The challenges concern data quality (e.g., missing data, 
outliers), learning while preserving privacy, the interpretability and generalizability of 
machine learning models, and clinical implementation. In the following section, we 
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will elaborate on some of these challenges and potential mitigation strategies. 


Missing Data One of the most common machine learning challenges faced is the 
occurrence of missing data in digital health datasets. There is a multitude of reasons 
why missing data occur, ranging from (human) data-entry error, missing measurements, 
dropouts in clinical studies, and merging unrelated data, to software errors in the 
data processing pipeline [285, 646, 663]. Missing data can have a significant effect 
on the data quality, lead to application performance degradation, cause analytical 
issues, and bias outcomes. In the context of medicine, the latter has been previously 
associated with misdiagnosis, wrong treatment decisions, and even discrimination 
of marginalized groups [656, 744]. Moreover, most state-of-the-art machine learning 
models require complete input variables. Missing data is typically handled by either 
the deletion of all data for an observation that has one or more missing values or the 
replacement with estimated values (i.e., imputation) [177]. A variety of methods have 
been developed that can account for different levels of sparsity in the data as well as 
efficiently handle missing information (Figure 2.1). Popular machine learning algo- 
rithms include k-nearest neighbors [50], multi-task Gaussian processes [729], random 
forest-based approaches [647, 670], matrix factorization [341, 692], discriminative deep 
learning methods [64], and generative deep learning methods [469, 596, 749]. Another 
elegant strategy of handling missing data is the use of end-to-end models that impute 
and predict jointly, such as Gaussian process adapter [393] and interpolation-prediction 
networks [610], and models that do not require imputation and can act on irregular 
data directly, including attention models and gated recurrent unit-decay [702, 725]. For 
a comprehensive review on the problem of missing values and strategies for handling 
missing data, see [188]. 


Outlier Detection Another noteworthy challenge is how to detect and handle out- 
liers in a dataset. Outliers are defined as observations that raise suspicion because 
they deviate markedly from other observations in the given dataset. Common causes 
of the occurrence of outliers in digital health datasets include, measurement error, 
data entry error, sampling error, and natural outliers. A special category of outliers are 
the intentional outliers, which are dummy outliers created to assess the efficiency of 
detection methods. It is important to note that outliers are inherently different from 
noise, which is commonly defined as a random error or variance. The outlier is part of 
the data and can even carry (clinically) important information, while noise is simply 
a random error (e.g., mislabeled data, missing data). Detecting outliers in a dataset 
is a highly relevant task as outliers can impact the distribution of the data, increase 
the error variance, reduce the power of statistical tests, introduce bias, influence esti- 
mates, and impact key assumptions of statistical tests. A multitude of statistical and 


14 — 2 Health / Medicine 


Methods for 
Handling Missing 
Data 


Strategies that take 
into account data dis- 
tribution 


Model-based 
likelihood 


- Maximum likelihood with 


Deletion Imputation 


Analysis of the 
variable with the 
missing data 


Machine learning 
based imputation 


- Listwise deletion - Mean, median, mode - K-nearest neighbour 

- Pairwise deletion - Regression (linear, logistic) - Neural network 

- Deleting columns - Multiple imputation - Multilayer perceptron 
- Hot desk - Self-organizing maps 
- Linear interpolation 
- Random sample imputation 


expectation-maximization 
- Bayesian methods 


Fig. 2.1: Strategies for handling missing data. 


machine learning methods exists for the task of outlier detection, including k-nearest 
neighbor [290], linear regression, naive Bayes [541], decision trees [620], and support 
vector machines [764]. The classic distance-based methods are empirically highly suc- 
cessful [104, 330, 759]. For instance, they might deem certain patients outliers if they 
are distant from other patients in the dataset. A patient is deemed an outlier if they are 
distant from a randomly drawn subset of historic patients (Figure 2.2). This scheme is 
extremely scalable, as it requires only the computation of distances between the patient 
and the small subset of historic patients, even for large clinical data warehouses. The 
size of the sample can even be explicitly optimized. 


Learning while Preserving Privacy Data privacy has become one of the most im- 
portant issues of our time. A breach of personal information can infringe fundamental 
rights and freedoms of an individual, including the risk of being identified and disclo- 
sure of personal (health) data. Data privacy breaches, such as the Facebook Cambridge 
Analytica scandal [287], have made patients and their caregivers reluctant to share 
sensitive and personal information. In response, data-privacy concerns have taken 
center stage and countries around the world have implemented legislation, such as the 
European Union’s General Data Protection Regulation (GDPR) [706] and the California 
Consumer Privacy Act (CCPA) [488]. Medical questions that are tackled by a data-driven 
approach often require access to sensitive, personal information. 

Major efforts have been undertaken to develop privacy-preserving machine learn- 
ing algorithms that keep the patients details secure without compromising the model’s 
performance [533, 719, 736]. Federated learning [394, 745] addresses some of these 
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Fig. 2.2: Illustration of the goal of abnormality detection in COVID-19 patients (left panel). The 
electronic health record of a new patient P5 is checked for being abnormal relative to all patients in 
the clinical data warehouse (here patients P1-P4) (center panel). Relevant features are extracted to 
compare the similarity of the new patient to previous patients. The features describing a patient are 
referred to as patient embeddings (right panel). Based on the (dis)similarity of the new patient to 
previous patients, an abnormality score is computed. In this example, the abnormality score is the 
distance to the most similar patient in the database, which is largest for P5 (0.97). 


concerns through training algorithms collaboratively without requiring exchange of 
raw data. In federated learning, the model parameters are handled centrally. To over- 
come this “concentration of power”, swarm learning was recently introduced [719]. The 
principle of swarm learning is to build the machine learning models independently 
on data from individual sites (e.g., hospitals) and share the model parameters via a 
so-called swarm network. With this approach, swarm learning secures data sovereignty 
while preserving privacy and confidentiality. Data mining is another popular discipline 
in medical data science concerned with preservation of privacy [12, 18]. Also known as 
knowledge discovery in data, data mining is the process of automatically uncovering 
novel patterns and trends in big data that would otherwise remain hidden [137]. In 
the recent years, data mining has been successfully applied in a variety of medical 
disciplines: detection of diabetes [36], cancer prognosis prediction [572], biomarker 
discovery [244], sepsis [91], and prediction of stroke mortality [186]. As data privacy 
should be preserved at all costs, numerous privacy-preserving data-mining methods 
have been developed. These include randomization [183], classification [130], cluster- 
ing [229], association rule [539], K-anonymity [134], L-diverse [425], distributed privacy 
preservation [182], condensation [10], and cryptographic [376, 501]. A comprehensive 
survey on the contributions of privacy preserving data mining techniques can be found 
in [16, 567]. 

Access to data generated within the healthcare systems is often restricted due to 
privacy and confidentially concerns. A strategy to make sensitive health data available 
is to de-identify or anonymize the data by deleting or encoding identifiers that link 
individuals (e.g., names and patient identifier), by perturbating the data (e.g., applying 
round-numbering methods and adding random noise), by swapping data (e.g., shuf- 
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fling dataset attribute values), or by generalizing information (e.g., grouping variables). 
Even with these de-identifcation efforts, it remains extremely difficult to guarantee that 
the re-identification of individual patients is not possible. A promising mitigation strat- 
egy is to generate synthetic, representative, data that can be safely shared. Synthetic 
data can both be used to augment datasets and generate artificial, but realistic patient 
data that can be shared, even across geographical and political borders. For this form 
of synthetic data generation, Generative Adversarial Networks (GANs) [235] are well 
suited. Briefly, GANs are a type of deep learning model that consists of two networks 
one called the generator and the other called the discriminator. These two networks are 
simultaneously trained competitively, as in a zero-sum game framework. The generator 
learns how to map from a latent space to a data distribution of interest, i.e., generate 
candidates of synthetic patients, and the discriminator evaluates the candidates 
distinguishing from the true distribution. In the context of clinical data, medical GAN 
(medGAN) [117] is a recent approach that can generate high-dimensional discrete 
variables via a combination of variational autoencoders (VAEs) and GAN. Furthermore, 
medical Wasserstein (medWGAN) and boundary-seeking GAN (medBGAN) improved 
the performance of medGAN to generate synthetic data from the “Medical Information 
Mart for Intensive Care” database and the Taiwan National Health Insurance Research 
Database [144]. Instead of a general GAN, medWGAN uses an improved generative 
network named WGAN-GP, where the model overcomes the issue of fails to converge 
in some settings owing to the use of the weight-clipping technique using gradient 
penalty [43]. Finally, medBGAN improves GAN training to create new samples that lie 
on the decision boundary of the discriminator at each update [43]. 


Interpretability and Generalizability of Models In addition to being applied in 
complex high-stakes settings such as medicine, machine learning algorithms are also 
becoming increasingly complex in terms of their architecture. At times, these algorithms 
become so complex that the humans forfeit comprehension of the underlying models 
or how variables are jointly related to make predictions. Modern machine learning 
algorithms are commonly referred to as “black boxes”. The alleged black box nature 
constitutes a major barrier to the adoption of machine learning in the clinical routine. 
But what is the reason for not trusting a machine learning model that has been proven 
to perform well and can accurately diagnose patients? “The problem is that a single 
metric, such as classification accuracy, is an incomplete description of most real-world 
tasks” [179]. The interpretability of machine learning models is critical to understand the 
accuracy of findings, identify variables that drive the predictions, improve model per- 
formance, guarding against embedded bias, and debug models. Medicine is among the 
domains where scientists are often compelled to implement simpler and interpretable 
machine learning models (e.g., linear models or decision trees) as every decision being 
taken by the model has to be interpretable. Clinical staff needs to understand why a 
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machine learning algorithm generates the results it does (i.e., interpretability) and 
ideally how it arrives at its conclusions (i.e., explainability) [190]. Better interpretability 
might come at the expense of model performance. As biological associations are hardly 
ever of a linear nature, complex models, including ensembles and neural networks, 
typically result in more accurate performance. Incomplete interpretability is often com- 
pensated with judgement, knowledge provided by domain experts (e.g., clinicians), 
rigorous monitoring, and diligent understanding of the data used. The development 
of methods to enhance the interpretability of machine learning models is a vivid area 
of investigations. As a matter of fact, numerous model-agnostic interpretation tools 
exist that can be applied to any supervised machine learning model [529]. There are 
two major categories of model-agnostic methods. First, local methods that describe 
individual predictions and secondly, global methods that explain how features af- 
fect the prediction as a whole. Table 2.4 provides an overview of common local and 
global model-agnostic methods. These model-agnostic interpretability methods allow 
researchers to interpret the results of (complex) machine learning models and can pave 
the way for the implementation of such models in the clinical routine. 


Tab. 2.4: Common local and global model-agnostic interpretation methods of machine learning 
models. For a comprehensive overview on interpretable machine learning, see [453]. 


Local methods Global methods 


Individual Conditional Expectation (ICE) [233] Partial Dependence Plot (PDP) [213] 


Local Surrogate (LIME) [527] Accumulated Local Effects (ALE) Plot [25] 
Counterfactual Explanations [711] Feature Interaction [214] 

Scoped Rules (Anchors) [528] Functional Decompositon [281] 

Shapley Values [661] Permutation Feature Importance [102] 


SHAP (SHapley Additive exPlanations) [661] Global Surrogate [152] 


Apart from being interpretable, generalizability is a desired attribute of machine learn- 
ing algorithms. Generalizable refers to the ability of a trained algorithm to perform well 
on unseen data. One particularly elusive challenge regards generalizability across differ- 
ent patient groups [489]. There are numerous examples of promising machine learning 
applications that struggled when applied to diverse populations. Google, for instance, 
introduced a machine learning algorithm for the diagnosis of diabetic retinopathy that 
performed poorly in India [7]. This is likely attributable to the fact that the algorithm 
was developed, trained, and evaluated on a dataset that lacked the necessary ethno- 
racial and demographic diversity. Another example of unintended effects of artificial 
intelligence is an algorithm that was developed to detect skin cancer on images of skins. 
While the algorithm performed well on fair skin, it was not able to reliably diagnose 
lesions on darker skin [240]. 
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These examples highlight the importance of data from diverse groups (i.e., in terms 
of sex, ethnic background, and race) are fundamental to realize universal precision 
medicine. The reality, however, is that data is often derived from a worryingly small 
and homogeneous sample of the population (e.g., white male individuals). As a result, 
ethnoracial disparities are evident in many patient populations and include differences 
in access to care, time to diagnosis, treatment, and mortality [545, 566, 748]. In order to 
eliminate bias and create fairness and equity, scientists have to be conscious of bias that 
can occur at different levels, namely data collection and selection, model development 
and evaluation, as well as model deployment and clinical implementation. Data is 
driving force behind any machine learning and artificial intelligence algorithm. That 
is why, the data underlying the development and evaluation of algorithms must be 
unbiased and representative of the target populations to avoid generating or perpetu- 
ating biases that may worsen patient outcomes. Often bias is rooted in systematically 
skewed data collection, e.g., through clinical trials predominantly carried out with 
white male participants, or the reliance on historical data that might have been subject 
to biased data generation or clinical practices. To mitigate bias in the data, diverse and 
well balanced study populations are crucial for any collection and/or selection of data 
(e.g., clinical trials, registry, electronic health records). Particular attention should be 
paid to ethnoracial diversity, sex/gender balance, socioeconomic equity, and other 
social, and ethical, determinants of disease and access to healthcare. Assuming that 
the available data is unbiased, researchers have to carefully select the data variables to 
avoid introducing a bias in the phase of algorithm development. 

If possible, algorithms can be tested on different patient populations for both 
scientific and ethical performance. Ideally, the development, evaluation, and clinical 
implementation of algorithms is done in liaison with clinicians to ensure that the 
algorithms do not exhibit bias in the clinical setting. Lastly, the healthcare system, in 
which machine learning tools are implemented, is an important entry point of bias. 
Awareness of inherent biases of machine learning assisted tools among healthcare 
professionals is pivotal to mitigate bias. This starts by ensuring equitable patient access 
to the technology, and then diligently observing how it performs in diverse populations 
and underrepresented communities. Any bias noticed should immediately be reported 
to an appropriate committee at the hospital, which can then communicate with the 
developers. Understanding bias inherent in medical technology allows clinicians to 
question the accuracy of the technology if the results do not meet the expectations 
from their clinical expertise. 


Clinical Implementation and Validation Considerable technological progress has 
been made over the last decades with machine learning applications transforming the 
clinical decision making and how health resources (e.g., data) are managed. Neverthe- 
less, the greatest challenge of machine learning applications is not the medical utility, 
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but rather the implementation in the clinical routine. While these developments are 
crucial to advance health care, they raise a number of ethical and legal ethical concerns 
that machine learning and artificial intelligence might harm patients and/or clinicians. 
Technological and diagnostic failures can lead to adverse events or seriously harm 
patients. It is not yet clear who is liable for malfunctions in, or erroneous decisions 
made by artificial intelligence-based clinical tools that result in inaccurate or delayed 
diagnosis [510]. The attribution of accountability becomes even more complicated when 
an artificial intelligence-based clinical tool gives a wrong treatment recommendation, 
yet the clinician makes the final decision. Is the clinician liable or can the liability be 
delegated to a company or person that engineered the tool? In [510], Price et al. provide 
an overview of potential scenarios and associated probable legal outcomes related to AI 
use in clinical practice. Under the current law, clinicians are shielded from liability as 
long as they do not deviate from the standard of care [532]. As a consequence, clinicians 
are advised to utilize machine learning-guided tools to support and confirm existing 
decision-making processes as opposed to solely relying on computational algorithm 
output for diagnosis or treatment selection [441]. The complexity further increases 
when considering that there are multiple stakeholders in the ecosystem of liability, 
including the healthcare institutions that purchase and implement computational 
algorithms. 


2.1.7 Future Directions of Machine Learning in Medicine 


The accelerating generation of unparalleled amounts of health data will lead to funda- 
mental changes in medicine and health care. Machine learning applications are poised 
to play an increasingly prominent role in medicine. Specifically, they will facilitate 
early disease recognition, refine diagnosis and prognosis, support therapy decisions, 
and streamline biomedical data management. Importantly, machine learning- guided 
systems will not replace clinicians or therapists, but will augment their efforts and time 
to care for patients thanks to guidance for clinical decision-making and the automation 
of time-consuming and repetitive tasks. As of yet, many barriers exist to the adoption 
of such applications in the clinical routine. In the coming years, the data explosion will 
continue and reach unparalleled dimensions, including the number of patients (e.g., 
UK Biobank), the length of time series (e.g., ICU monitoring data, wearable devices), 
and the breadth of data type (from molecular to higher-level phenotypes). This affluence 
of rich datasets will open new avenues for machine learning-driven applications to 
assist in clinical decision-making. In addition to refining clinical processes, machine 
learning and artificial intelligence will also play a pivotal role in other important areas 
of biomedical research, such as protein structure prediction, molecule design, drug 
discovery, or single-cell research. A groundbreaking example is the recently introduced 
machine learning approach AlphaFold [308], which performs predictions of protein 
structure with unprecedented accuracy, by incorporating physical and biological knowl- 
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edge. Besides the technological advances and the availability of vast amounts of data, 
overcoming the challenges of handling missing data and outliers, preserving privacy, 
interpretability, and clinical application will be critical for the adoption of machine 
learning-supported guidance tools in clinical routine. 
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Abstract: Amid the accelerating spread of viral diseases throughout the world, the 
rapid detection of pathogens is of essential importance. Viruses can be transmitted very 
quickly via contacts in public transportation or at large social events. Therefore, a rapid 
virus test system for such crowded locations is highly desirable. The plasmon-assisted 
microscopy of nano-objects (PAMONO) sensor is one such analytical instrument. The 
sensor required the development of software and optomechanical parts to detect viruses 
in complex biological liquids. While the focus lies on viral particles, other nanoparticles 
can also be analyzed by employing a similar principle. The latter issue vastly expands 
the potential application field of the PAMONO sensor. The developed methods are 
tailored to the spatiotemporal characteristics of the underlying sensor system, making 
use of the adaptivity of machine learning approaches. As a result, 80 nm to 300 nm 
particles can be detected in signals with different types of imaging artifacts and different 
resolutions, reaching accuracies of over 80 % with respect to the expected particle 
counts of test samples. For mobile use as a rapid test system, resource-saving and 
real-time capability are of similar importance to make the device accessible in as many 
application areas as possible. Multi-objective optimization in terms of detection quality 
and energy consumption was applied to demonstrate that the usage as a mobile system 
is feasible. 


2.2.1 Introduction 


For the detection of nanoparticles such as virus particles a device that can make them 
perceptible is required. Such a device is the PAMONO sensor [385, 607], which pumps a 
liquid or air sample through a flow cell to reveal the particles of interest contained in it. 
This is achieved by making use of what is known as the Surface Plasmon Resonance 
(SPR) phenomenon [340]. 

Special proteins—antibodies—immobilized on a gold sensor film help to accomplish 
the specificity of viral particle detection. Particles that bind to the antibodies cause 
local changes in the reflection conditions near the surface of the film. Due to locally 
increased reflection, particle binding events become detectable by a charge-coupled 
device (CCD) [389] integrated into the PAMONO sensor. 
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The limits of a manual analysis are reached quickly, when trying to identify particle 
signals. Even after enhancing the visibility of particles in the recorded images by pre- 
processing, finding particle regions is a tedious and time-consuming task for a human 
observer. Test analyses determined that it takes an expert approximately two days to 
accurately analyze a dataset for an experiment of around 4000 images with varying 
results for different particle sizes, different levels of disturbances, and different human 
observers [613]. This time range, the need for visually trained experts, and the devia- 
tions in the subjective perception of different persons predestine this task to become 
the subject of automated analysis to enable the PAMONO sensor to be used as a rapid 
test. 

To reduce manual interactions and to enable quick testing by non-expert users, 
more adaptive solutions from the field of deep learning were developed to adapt to 
specific signal characteristics while tolerating deviations that inevitably occur between 
different recordings when operating outside a controlled environment. 

While manual interactions are shifted from on-site usage to training time, these 
approaches make actual on-site testing faster. The challenge which arises in return is 
the high amount of manually annotated training data to learn the patterns of interest. 
At the same time, the recording and annotation of new experiments cause material 
and time costs. This is a problem that is worth addressing since it can be observed in 
various tasks of medical data analysis. Dealing with the limited availability of training 
data while leveraging the generalization of machine learning is, therefore, a key aspect 
in this area. 

There are three major challenges to be addressed when detecting nanoparticles 
in samples: dealing with varying artifact characteristics and intensities, real-time ca- 
pability, and the feasibility of mobile usage. Considering the on-site operation, the 
concept of mobility again contains the aspect of resource optimization in terms of 
computing power and energy consumption. We present an approach for multi-objective 
optimization of parameters in the employed algorithms. This optimization can target 
high detection accuracy or low energy consumption. Alternative approaches based 
on deep learning overcome the need for defining specific operators by learning them 
from more general functions. The aspects of the application under natural conditions, 
including the analysis of physical particle sizes, are viewed with particular attention. 
The analysis of physical particle sizes is also described in that context. It can provide a 
more accurate classification of the contained particles and enables plausibility checks 
by taking domain knowledge about the specific types of particles into account. 

Here is a short overview of the sections below: Section 2.2.2 provides details on the 
different types of particles. The setup of the PAMONO sensor is detailed in Section 2.2.3. 
Section 2.2.4 describes the underlying data characteristics and the classic and deep 
learning-based methods for the detection of nanoparticles. Section 2.2.5 presents an 
approach developed for the multi-objective optimization of parameters used in an 
operator. Section 2.2.6 describes aspects of the application in natural environments 
with particular emphasis on determining the size of the analyzed particles and an ap- 
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proach that increases the robustness of image analyses based on generative adversarial 
networks. Finally, Section 2.2.7 provides an outlook on potential approaches to improve 
the hardware and software of the PAMONO sensor. 


2.2.2 Types of Detectable Nanoparticles 


It is first worth noting which types of particles the PAMONO sensor can analyze. Different 
physical and biological characteristics can be spotted, that are attributable to biologi- 
cal Nanoparticles (bio-NPs). Besides particles of interest such as viruses, Virus-Like 
Particles (VLPs) [754], and Extracellular Vesicles (EVs) [504, 505], there are interfering 
objects can also be observed, such as lipid and protein agglomerates, which are often 
considered contaminating substances hampering the bio-analytical examination of 
samples. 

While viruses are well-recognized as the smallest infectious agents containing only 
one type of nucleic acid, Ribonucleic Acid (RNA), or Deoxyribonucleic Acid (DNA) [332], 
VLPs, and extracellular vesicles are significantly less analyzed. It is important to high- 
light the key difference between VLPs and viruses: VLPs do not possess any nucleic 
acids and, thus, lack a principal opportunity to reproduce themselves in the host organ- 
ism. On the other hand, VLPs carry the same antigens (molecules considered foreign 
by the immune system) on their surface as the corresponding native viruses [754]. Thus, 
VLPs in science can serve as a safe and reliable model of dangerous viruses since VLPs 
efficiently mimic the structural properties of corresponding viruses but cannot replicate. 
In practice, VLPs are well known as commercial medical products serving as a basis for 
vaccines [754]. 

Another group of bio-NPs, Extracellular Vesicles (EVs), have recently started to 
attract the attention of scientists and physicians. EVs are submicro- and nano-sized 
vesicles released by the majority of cells [505]. Another principal feature of EVs is their 
ability to carry different molecules inside as well as on their surface. Among such active 
molecules are hormones, growth factors, active peptides, and nucleic acids [505]. Their 
cargo makes EVs active messengers participating in intercellular communication and 
reflecting cellular status under normal conditions or during the pathological processes. 
Moreover, the abundance of EVs in body fluids such as blood or saliva drew the attention 
of clinicians and medical researchers, who harness EVs as a means of drug delivery or 
to estimate their potential as biomarkers of the progression of a disease. 

However, any analysis of bio-NP samples for scientific and practical needs requires 
the selection of reliable techniques and instruments. Certainly, bio-NPs have to be 
swiftly characterized for their abundance in a sample and their size. From a different 
perspective, biochemical information regarding their surface antigens (proteins) and 
their content is of interest as well. A simultaneous quantification and determination of 
the sizes of bio-NPs can be achieved by the principle of surface plasmon resonance (SPR). 
The plasmon-assisted microscopy of nano-objects (PAMONO) sensor, which exploits 
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this principle, is an instrument for label-free and specific detection of individual bio- 
NPs in solutions [245, 772]. In the following section, the sensor and the underlying 
principle for visualizing particle signals are introduced in more detail. 


2.2.3 PAMONO Sensor 


Surface plasmons can be thought of as propagating electron density waves. These 
waves can be excited by incident light (usually by an incident laser beam) in a thin 
metal film at a dielectric-metal interface. It is precisely this thin metal film that serves 
as a sensor surface. Surface Plasmon Resonance (SPR) is a physical phenomenon that 
served as a basis for the development of the PAMONO sensor. 

Conventional SPR biosensors deal with measurements of the layers of bio- 
molecules formed onto the sensor surface, and thus, conventional SPR sensors are not 
applicable for the detection of individual Nanoparticles (NPs). By contrast, the special 
quality of the PAMONO sensor is exactly the ability to detect the binding of individual 
NPs to the gold sensor surface [772]. Kretschmann’s scheme of plasmon excitation is 
utilized in the PAMONO sensor as well as in the majority of conventional commercial 
SPR-based biosensors [350]. However, there are specific issues that distinguish the 
PAMONO sensor from known conventional SPR biosensors. In Kretschmann’s scheme, 
shown in Figure 2.3, a p-polarized light (polarization of the electric field occurs in the 
plane of incidence) illuminates a glass prism with a very thin (tens of nanometers) 
noble metal film deposited on the base of the prism. Often a superluminescent diode or 
a diode laser is used as a source of light [350]. Surface Plasmons (SPs) are excited as 
propagating electron density waves at the metal-dielectric interface in the presence of 
p-polarized incidence light at a particular angle [328, 577, 693]. This event occurs when 
the energy of an incidence beam transforms into electron-polaritons within the thin 
metal film deposited on the prism. SPs excited along the metal-dielectric interface result 
in a substantial reduction of the reflection intensity. In turn, this fact leads to changed 
reflection conditions, which are extremely sensitive to any refractive index changes 
occurring close to the metal-dielectric interface [577]. Such changes can be caused by 
the adsorption of molecules onto the metal film surface. SPR-based biosensors harness 
this trait and enable the analysis of interactions between bio-molecules immobilized 
onto the metal film surface and their counterparts in the analyzed liquid sample. Such 
analysis can be performed in real time and without labeling the target molecules. Thus, 
it is not surprising that conventional SPR biosensors are actively used to measure 
binding constants and the kinetics of bio-molecular interactions, and to perform 
concentration measurements [577]. 

The PAMONO sensor also harnesses the most convenient scheme of plasmon ex- 
citation: Kretschmann’s configuration [350]. Figure 2.4 shows a photograph of the 
device setup and the flow cell as the core of the apparatus. The entire device fits into a 
suitcase-sized enclosure. 
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Fig. 2.3: Schematic setup of the PAMONO sensor (left), abstract view of a used antibody coating (top 
right), preprocessed image of an attached particle clearly visible as a bright elliptic region (bottom 
right, left image), and the average pixel intensities of this area over time (bottom right, right image); 
modified from [669]. 


Moreover, the events preceding the substantial reduction of the reflected light as well 
as the events leading to the restoration of reflection are similar for the PAMONO sensor 
and conventional SPR-based sensors. In the case of the PAMONO sensor, such events 
occur locally, in the spot of NP binding, not on the entire sensor surface as it happens in 
the case of classic SPR sensors [773]. A developed model [773] explains key differences 
in physics between the detection of bio-molecule layer formation (classic SPR sensor) 
and individual NPs (PAMONO sensor). Polystyrene NPs were employed as a model 
system [245]. The use of these particles helped to demonstrate linear dependency 
between the number of signals detected by the PAMONO sensor and the concentration 
of particles in liquid samples. In turn, this fact confirmed the applicability of the 
PAMONO sensor for the concentration measurements of NPs, in which NP concentration 
is expressed as a number of particles in a volume unit [245]. The work of Shpacovitch 
and colleagues [608] focused on the bio-analytical features of the PAMONO sensor and 
demonstrated the ability of the PAMONO sensor to detect not only HIV-VLPs (100 —- 
140 nm) but also influenza A viral particles (80 —-120 nm). 

The selectivity studies were performed in phosphate buffered saline (PBS buffer) 
employing specially engineered HIV-VLPs of two types: one containing target protein on 
the surface and one lacking it [608]. Under these conditions, the selectivity of HIV-VLPs 
binding to the PAMONO gold sensor surface reached 90 % without special treatment of 
the sensor surface with substances preventing the binding of non-target VLPs [608]. 
Moreover, the ability of the PAMONO sensor to work with biological samples containing 
serum was also demonstrated [608]. 
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(a) The real setup of the PAMONO sensor corresponding to (b) The flow cell with mounted gold-coated 

the schematic setup shown in Figure 2.3. The case of the glass plate attached to a prism base. The 

instrument is approximately the size of a suitcase. tubing system serves to guide a liquid sample 
into and out of the flow cell. 


Fig. 2.4: Photos of the PAMONO sensor (a) as a whole and of the flow cell (b) as the heart of the 
device individually. 


Further work [609] proved the power of the PAMONO sensor for the detection of the 
other type of bio-NPs: extracellular vesicles. The authors employed cysteine-conjugated 
protein A/G for the functionalization of the PAMONO sensor surface. This was done 
to allow for the elution of bio-NPs captured on the sensor surface and, thus, enable a 
post-PAMONO analysis [609]. Moreover, the PAMONO sensor was capable of supply- 
ing sufficient information for the sizing of studied polystyrene nanoparticles. Such 
information could be extracted from the intensity step signal caused by NP binding. 
It is important to mention that the Nanoparticle Tracking Analysis (NTA) instrument 
Malvern Panalytical NanoSight LM10' was used as a reference method in the studies 
performed with the PAMONO sensor. Thus, it was also necessary to verify the accuracy 
of the LM10 instrument before its use in the studies. This work was performed by Usfoor 
and colleagues [696]. It was demonstrated that NP size measurements performed by 
the LM10 device are quite accurate, but concentrations were not determined precisely. 
Moreover, the NTA analysis of bio-NPs requires the labeling procedure of target parti- 
cles, while the PAMONO sensor provides results employing a label-free approach. In 
detail, the drawbacks and advantages of the PAMONO sensor and other SPR-based plat- 
forms for the sizing, quantification, and biochemical analysis of extracellular vesicles 
are given in a review work [606]. One of the advantages of the PAMONO sensor is the 
possibility of NP quantification without prior calibration, as as shown by Kuzmichev 
and colleagues [361]. 

The analysis of sensor data requires the use of robust detection algorithms that can 
adapt to data variations in real use cases. Approaches that incorporate this criterion 
are presented in the following sections. 


1 Malvern Panalytical NanoSight LM10, https://www.malvernpanalytical.com/en/products/product- 
range/nanosight-range/nanosight-]m10, accessed 31 March 2022. 
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2.2.4 Automated Nanoparticle Detection 


Based on the challenges of nanoparticle detection, a signal model was developed to 
represent the individual components of the total signal 


I(x, y, t = B(x, y)- A(x, y, 0-POo,y,0+ROQ,y,6 (2.1) 


which consists of the background signal B, which is constant within a recorded im- 
age, an artifact signal A, the signal of interest P, and a residuum R, which includes 
random noise [614]. Figure 2.5 shows an example of an unprocessed recording and 
the corresponding image in which the contained particle signal is made visible in a 
preprocessing step by removing the background B and reducing the noise R. This is 
achieved using a sliding-window approach that averages signal values, amplifies pixel 
intensities increasing over time, and weakens constant or falling intensities. Subse- 
quently, a dynamic contrast enhancement is applied in some approaches to further 
emphasize particle signals [733]. After that, the regions of interest can be spotted as 
elliptical areas that are brighter than their environments. Based on the presented sig- 
nal model, the goal of the automated detection is to make the pixel values of signals 
of interest P more distinguishable from their surroundings. Typically this is done by 
employing approaches that attempt to highlight particles directly and, in some cases, 
by weakening artifact signals beforehand. 

From the algorithmic point of view, the detection of nanoparticle signals on pre- 
processed images can be seen as a combined task consisting of blob detection [639] 
and time series analysis as particles can only be reliably recognized when their spatial 
features like region size and shape as well as their temporal behavior are taken into 
account. The characteristic temporal behavior, which can be described as a step-like 
curve, is shown in Figure 2.6 for two images from sets with different particle sizes. From 
the algorithmic point of view, the detection of nanoparticle signals on preprocessed 
images can be seen as a combined task consisting of blob detection [639] and time series 
analysis. This is because particles can only be reliably recognized when their spatial 
features like region size and shape as well as their temporal behavior are taken into 
account. The characteristic temporal behavior, which can be described as a step-like 
curve, is shown in Figure 2.6 for two images from sets with different particle sizes. 
The origin of the temporal characteristics lies in the way a particle of interest interacts 
with the antibody layer. When such a particle attaches to it, it remains in contact for a 
prolonged time. It is assumed that, in the time of one recording, particles of interest 
stay attached permanently while other particles only cause short-time peaks in the 
measured intensities as their physical shape does not match the specific antibodies 
applied to the gold film (see Section 2.2.3). 

A closer look at intensity curves belonging to particles of different sizes reveals 
that a larger particle causes a higher intensity difference 


I(x, Y, t;) - I(x, Yy, ti), ti < tj 
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(a) Unprocessed (b) Preprocessed 


Fig. 2.5: Comparison of (a) an unprocessed, recorded image and (b) the same image after preprocess- 
ing, which enables detecting particles as ellipsoids with high pixel intensities compared with the 
background. The particle that can be seen in the preprocessed image can hardly be detected in the 
unprocessed image data. To improve the visibility of particles, the preprocessing has to make use of 
temporal information from a time window around the current frame. 
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(a) Intensity of an 80 nm particle over time. (b) Intensity of a 200 nm particle over time. 


Fig. 2.6: Exemplary comparison of two signals over time belonging to different particle sizes. 


for a position (x, y) and times t;, t; than a smaller particle. With smaller particle sizes, 
the particle signal P becomes weaker, while inaccurate adjustments of the optical 
instruments, less clean samples, and increasing external influence imposed on the 
device increase the intensities of the artifact signal A and the residuum R. This results 
in a lower signal-to-noise ratio (SNR) [127] 


swr(p, Q) = KE- KOI (2.2) 
o(P) 

that is determined based on the particle signals P and the non-particle signals Q := AUR, 
where p is the average value and ø is the standard deviation of a set of intensity values. 
While not suitable for analyzing structured artifacts, the measurement can illustrate the 
visibility of particles in an image that predominantly contains random noise besides 
the particle signals. A low SNR value indicates a bad detectability of particles caused 
by a low intensity of particle signals P or a high intensity of random noise R. Specifying 
a metric indicating the strength of the influence of structured artifacts would require 
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Fig. 2.7: An abstract processing pipeline for PAMONO images showing different methods that can be 
used for preprocessing, pattern detection, candidate feature extraction, and classification [385]. 


a more complex definition that is less interpretable. For this reason, the SNR value is 
used when a directly understandable indication of particle visibility is desired. 


2.2.4.1 Stages of Detection 

With respect to the characteristics described in Section 2.2.4, different methods exploit- 
ing spatial and temporal features for nanoparticle detection were evaluated. Figure 2.7 
shows the stages of an abstract detection pipeline containing alternative approaches 
for the different stages. In general, the task of automated nanoparticle detection can be 
divided into different subtasks. The first step of any presented detection approach is 
preprocessing with the goal of image restoration based on a signal model. In addition, 
image enhancement techniques are used to improve the distinguishability between 
particles and other signals. With the preprocessed images at hand, a segmentation of 
the image areas and the detection of candidate regions takes place, which is summa- 
rized as the pattern detector. Then a feature extraction method determines features that 
are used in a pattern classifier to check the properties of candidates so that non-particle 
candidates can be filtered out. 

The whole pipeline can be optimized for a specific criterion using the offline pa- 
rameter optimization described in Section 2.2.5.1. 

For preprocessing images, a sliding-window method with a fixed window size in 
different variations [399, 733], as well as a constant background removal method [613], 
have shown to be effective in different approaches. Depending on the downstream 
algorithm for detection, additional noise reduction techniques such as Gaussian or 
median filtering [613], wavelet denoising [401], brightness correction [399], or dynamic 
contrast enhancement [733] are used to provide a better separability between particles 
and other signals. 

While a variety of approaches is applicable for subsequent tasks, different ap- 
proaches offer diverse strengths and weaknesses for different data characteristics. The 
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Fig. 2.8: Example of a classic pipeline for the detection of nanoparticles based on template matching 
and time-series analysis for the generation of particle candidates and the evaluation of polygon 
shape features in a random forest approach for classification (adapted from [399]). 


following are available operations for the subtasks of particle detection. An appropriate 
selection of operations is presented in Section 2.2.5.1. 

For the task of pattern detection, (fuzzy) template matching in different vari- 
ants [399, 400, 401, 613] and convolutional network approaches [387, 733, 746] were 
developed. With particle candidate regions generated by the pattern detector, feature 
extractors obtain spatial or spatiotemporal features, which are then evaluated by a 
pattern classifier. Extractors generate, for example, polygon shape features like the 
covered area or the circularity of a polygon [399] or measures describing representations 
in other spaces like the Fourier or wavelet space [746]. 

Evaluated classifiers are, for example, k-nearest-neighbor [615], support vector 
machines (SVM) [615], random forests [387], and convolutional neural networks [387, 
733]. 

In the end, a connection between regions to single particle traces distributed over 
consecutive frames is established. This is particularly important when counting the 
detected particles is desired instead of just detecting if particles are present at all. 

A concrete example of a classic detection pipeline is shown in Figure 2.8. It utilizes 
template detection and fuzzy time series analysis [399] for pattern recognition and 
classifies particle candidate patches with random forests based on polygon shape 
features. 

Template matching, which has proven effective in various examples, uses a pre- 
viously recorded particle region as the predefined template patch T to detect similar 
regions of the same size in the current image I. This is done by calculating a normalized 
cross-correlation 


Dey T(x, yx +x,y +y) 
Eey T(x, yP +y Ey I(x + x,y +y) 


for each pixel position (x,y) in the image with the template patch [100, 399]. In simplified 
terms, the template patch is moved over the image while the correlation of the template 
patch with the underlying image area is determined at each position. 

Although using classic methods like template matching with optimally adjusted 
parameters can lead to high detection accuracies, slightly outperforming some neural 
network approaches [385], it has to be noted that this is only possible if the parameters 
are individually optimized for each change in the setup and, in the worst case, for 
each dataset [385, 613]. This is a disadvantage for use as a rapid test on-site since 


R(x, y) = 
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Fig. 2.9: Modification of the classic detection pipeline shown in Figure 2.8 for the use of neural 
networks. Template matching, time-series analysis, and the random forest classifier were replaced 
by a neural network. These replaced modules are colored green. 


changes in the characteristics of artifacts have to be covered without the need for 
manual adjustments for each dataset. This demand leads to the employment of neural 
network approaches, which are presented in the following section. 


2.2.4.2 Spatiotemporal Deep Learning 

The detection of signals of interest from a real-world measurement always comes with 
irregularities. Changes in the environmental influence have to be taken into account, 
as well as the varying cleanliness of the samples. Dealing with these influences re- 
quires a technique that can adapt to the changes while concentrating on the typical 
characteristics of the particles of interest. At this point, the advantages of deep learning- 
based methods can be exploited. Instead of adjusting to a restricted scenario, a deep 
learning-based approach can take advantage of previous recordings by using them to 
approximate patterns that are typical for a particle of interest. Rather than calculat- 
ing the features based on a static method, neural networks choose from a large set of 
possible feature extraction operations limited only by the number of freely learnable 
parameters and their architecture. Several deep learning approaches have been applied 
to the PAMONO recordings to evaluate their detection performance [385, 387, 733, 746]. 


Deep Learning Integration Into the Detection Pipeline There are different ar- 
chitectures of neural networks that can be integrated into the pipeline presented in 
Section 2.2.4.1. One pipeline modification using neural networks created to improve the 
flexibility of particle detection is presented in Figure 2.9. Several modules of the former 
pipeline shown in Figure 2.8 are replaced by neural networks for spatial or temporal 
classification of the respective inputs. 

The key difference between the deep learning approach and the use of more direct 
methods like template matching is the way the parameters are used. By the layer- 
wise connection of learnable operators, this adaptation takes place on lower levels, 
such as edge detection, as well as on higher levels where possible particle shapes 
learned from training data can be correlated with a given image. For this purpose, 
backpropagation [264] is used in combination with a loss function, which in this case 
is the cross-entropy loss that is predicted in the current training step [387]. The whole 
process aims at the minimization of the evaluated loss functions, that is, the creation 
of a minimal divergence between predicted and expected classes. 
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The presented version with parts of the operations replaced by neural networks was able 
to achieve results on data with a median SNR value of 0.7 (see Equation 2.2) for images 
that were mainly affected by random noise. With the classic pipeline, this level was 
only possible on datasets with a median SNR value of 1.25. Although achieving slightly 
worse results compared with a template-matching approach which was optimized for 
each dataset separately, the neural networks can cover different situations without 
having to be adjusted anew [385, 733]. This property is highly desirable when handling 
shifts in the data characteristics, which are described in more detail in Section 2.2.6.2. 


Deep Learning-Focused Pipeline The modified pipeline described in the previous 
paragraph increases the adaptability to changed characteristics of nanoparticle signals 
by replacing some classic operators with neural networks. At the same time, the modi- 
fied structure still relies on inflexible methods for the extraction of candidate regions 
and patch extraction. The next step towards a completely adaptable structure is the 
employment of a neural network that is capable of proposing candidate regions by 
itself. In this way, different sizes of particle regions can be taken into account while 
learning the characteristic features of particles. When evaluating the architecture on 
datasets with different characteristics due to changing optical instruments and differ- 
ent image resolutions, accuracies of over 80 % could be achieved without adjusting 
parameters between analyses [733]. The pipeline which was proposed to achieve this 
goal is illustrated in Figure 2.10. It puts a stronger focus on the use of learned func- 
tionalities. The first step of detection with this approach is the spatial prediction of 
candidate regions in sliding window-preprocessed images. The downstream filtering 
of the candidates takes the temporal changes in pixel intensities into account and 
decides whether it corresponds to a characteristic of interest. Each of the two steps is 
learned using a neural network. The precise implementation of the spatial predictor is 
based on a Mask R-CNN [263], which has already been shown to be capable of handling 
the task of nuclei detection [302, 752, 761]. The Mask R-CNN itself uses a ResNet-50 
Feature Pyramid Network [414] to generate abstract features for downstream detections. 
In terms of the classic pipeline shown in Figure 2.8, this functionality can be called 
a pattern detector. To speed up the training process and improve the generalization 
capability, we employ the concept of transfer learning and use the initial weights ofa 
ResNet model pretrained on the Microsoft Common Objects in Context [415] dataset. 
Since the more universal, low-level patterns used for the detection process are already 
set, the training process can focus on adjusting the weights in the last layers containing 
high-level features. An additional advantage of the used network is that it is not bound 
to one specific tile size as in previous approaches. Rather, it can determine the sizes of 
particle regions from a given set of possible dimensions. This results in higher flexibility 
while retaining the possibility of restricting accepted sizes to a specific range based on 
external knowledge. 
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Fig. 2.10: Architecture of the spatiotemporal pipeline for particle detection. It preprocesses the 
recorded image stream in two ways, each fitting to its downstream task, predicts particle regions 
for each image, and connects these regions to countable particle traces stretching over multiple, 
consecutive frames [733]. 


The temporal filter network is kept simple, consisting of three fully connected layers 
with intermediate activation functions. It checks if the temporal view of the candidate 
region fits a particle signal and sorts out artifact signals that resemble the signals of 
interest spatially only for a short time. The still hand-crafted, fixed parts of the detection 
process can be found in two places: the preprocessing and the connection of confirmed 
particle regions to countable particle traces. By learning the core functionalities of the 
detection task, this approach was able to reach accuracies of over 80 % with respect 
to the expected particle counts of the test sets, which contained signals of 80 nm to 
200 nm particles with different recording qualities, image sizes, and particle region 
sizes. The sets originate from different development stages of the PAMONO sensor, so 
setups with different optical instruments and camera configurations were included. 
The possibility of taking these differences into account indicates the high flexibility of 
the proposed solution. 

In summary, classical methods such as template matching can work well for particle 
detection methods if they are specifically adapted to the specific imaging conditions. 
On the other hand, deep learning approaches provide better adaptability to distinct 
conditions with the drawback that they usually require a large amount of training data. 
However, specialized training approaches or the incorporation of domain knowledge 
can reduce the necessary training data. An example method dealing with this challenge 
explicitly is presented in Section 2.2.6.3. 
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Fig. 2.11: The SynOpSis approach for automatic parameter adjustment. The optimization process 
tries to approximate the optimal parameters for the detection methods. Training is done on previ- 
ously analyzed data. The amount of test data is increased artificially by a synthetic combination of 
background and particle signals of different training datasets. In the end, only the parameters of the 
detection method are changed without increasing the calculation complexity at test time [613]. 


2.2.5 Optimization 


Most of the methods that can be used for object detection and other image-processing 
algorithms require suitable parameter values to be identified and set for improved detec- 
tion. When trying to determine the best parameter settings for the detection of particles 
for a given algorithm or even for a complete pipeline as described in Section 2.2.4, the 
limit of what can be accomplished manually is reached quickly. A method utilizing 
a genetic optimization approach to handle this search automatically is presented in 
Section 2.2.5.1. Section 2.2.5.2 focuses on the possibilities of specialized optimization 
targeting an energy-efficient execution. 


2.2.5.1 Algorithmic Optimization 

A method that was developed for the purpose of automatic parameter selection is 
the SynOpSis (synthesis/ optimization/ analysis) approach [613], which is illustrated 
schematically in Figure 2.11. It makes use of previous examples and synthetic augmen- 
tations based on new combinations of particle signals and background signals from 
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different datasets to artificially increase the number of available images for training 
and testing. With a given interval of plausible values for the underlying parameters, it is 
run repeatedly with different parameter choices and with the goal of approximating the 
ideal settings for the given data. In particular, the Non-dominated Sorting Genetic Al- 
gorithm II (NSGA-II) [163], a multi-objective optimization approach, is exploited to find 
Pareto-optimal settings for the parameters used in the detection algorithms [613]. After 
that, the parameters that lead to the best results based on a predefined criterion are 
chosen for the single stages of the pipeline. While the optimization itself has to execute 
the detection pipeline after each optimization step, the actual detection is not slowed 
down. The reason for that is that after the optimization is done, the parameters are just 
transferred to the corresponding algorithms without conducting additional calculations 
while the actual detection takes place. Despite the benefits of tuning the algorithms 
to some training data, the optimization is limited to the given feature extractors and 
the parameter sets that are presented to it. A problem with this can appear when the 
selected operators fit a specific situation instead of generalizing, as can happen with 
template-matching approaches [385, 613]. 


2.2.5.2 Resource Optimization 

The focus of resource optimization for the task at hand can vary depending on the 
concrete application scenario, while several goals can hinder each other in their ful- 
fillment. Depending on the place of use, the energy efficiency of the used calculation 
platform can be highly important as a reliable external power supply can not be as- 
sumed everywhere. An energy-efficient device can thus cover a wider range of operating 
locations and provide better portability in general, as with lower energy consumption, 
smaller computing platforms like embedded systems can be operated for some time 
over by a battery. For this purpose, work on this subject could demonstrate a concept 
for balancing between a reduction of the overall energy consumption resulting in a 
higher battery lifetime and a shorter execution time [403]. The computations were 
either executed locally on a mobile graphics processing unit (GPU) or offloaded over 
a wireless network to transfer the sensor images as well as computation requests to 
a server. A complete offloading of computations was compared with an alternative 
in which the method can select the calculations that are offloaded based on a given 
objective such as energy-saving or a low execution time. For this purpose, a calculation 
of the consumed energy is required. This is achieved using the energy model 


total = PGPU * tepu + Popu ' tepu + PLTE ' ÉLTE 


which is based on the average powers pepy, Pepu, and pirg of GPU, CPU, and the 
communication and the corresponding times tgpu, tgpu and ttre [403]. The result can 
be used to approximate the time that an algorithm can run with one charge. Although 
LTE was chosen as the communication technique, the energy model can be transferred 
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to other communication standards. To compute the overall runtime 


ttotal = tapu + tcpu + ÉLTE + server — EParallel 


the time tserver needed by the server and the time tparane1, Which can be saved by parallel 
operations on the client and the server, are also taken into account. 

Using simulation software to simulate different hardware setups such as more 
powerful and more energy-efficient GPU alternatives, a share of 90 % of the energy could 
be saved compared with the configuration with the fastest calculation while optimizing, 
with the goal of a low overall execution time, a speedup of 55, that is to say, a division of 
the execution time by 55, could be achieved in comparison to the most energy-efficient 
configuration [403]. It was also observed that using the most power-efficient GPUs in 
the example setups did not lead to the most power-efficient total configuration. Instead, 
a GPU with more computing capacity could calculate results faster, leading to a lower 
total energy consumption [403]. Additionally, it was recognized that the usage of a GPU, 
in general, can reduce energy consumption significantly compared with an execution 
purely on a CPU [402]. 


2.2.6 Application in Real-World Scenarios 


Besides the direct optimization of the detection methods to a specific data basis, there 
are further factors to consider in a real-world scenario. 


2.2.6.1 Identification and Influence of Particle Size Distributions 

While the most important task of the PAMONO sensor is to decide whether particles 
of interest are present in a sample, the determination of physical particle sizes brings 
additional advantages. A size distribution gives more information on the detected par- 
ticles and enables plausibility checks helping to uncover outliers caused by impurities 
in the sample. Domain knowledge can be used for this purpose: since the device is 
adjusted beforehand to a specific particle type corresponding to the applied antibody 
coating, the expected range of particle sizes is known. 

In tests with samples containing different, well-defined particle sizes, it could be 
demonstrated that the PAMONO sensor can be used to distinguish the different sizes 
from each other. The quality of the predicted size distributions was also compared with 
a commercial device showing that the predictions based on the PAMONO sensor rank 
at the same level of size prediction quality. For this purpose, the difference between 
the median signal intensity before and after a particle attaches is calculated. When 
comparing two particle sizes, this difference is proportional to the difference in signal 
intensities. An example of this is shown in Figure 2.6. A visualization of the analyses 
with the PAMONO sensor and, for comparison, a commercial nanoparticle tracking 
device for samples containing 100 nm, 200 nm, and 300 nm particles, as well as one 
mixture containing all three sizes, can be seen in Figure 2.12. 
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Fig. 2.12: Determined physical particle sizes for four different suspensions analyzed with (from left 
to right): 100 nm particles, 200 nm, 300 nm, and a composition of 100 nm and 200 nm and 300 nm 
particles. The top row shows reference distributions obtained with the commercial Malvern LM10 
device while the bottom row shows the results of the PAMONO sensor [609]. 


The soft accuracy value of detected particle size distributions can be evaluated with 


ae Hee E||cP™*4(e) —c"(e)| < r}| 
lel 


calculating the share of predicted sizes cP that approximately match the correct size 
class c“°" for each patch e € € of an analyzed image. 

With the chosen division into classes of 10 nm, the classification of a 100 nm particle 
into say, the size class 80 nm, is tolerated with r = 2. At the same time, a prediction of 
130 nm would be considered a false value for a 100 nm particle as a class interval of 
10 nm together with r = 2 results in a tolerance of 20 nm. In an evaluation, particles of 
the sizes 80 nm, 100 nm, and 200 nm were analyzed. With classes of 10 nm intervals, 
an accuracy of over 70 % was reached for r = 2 [386]. 

In general, the PAMONO sensor can be calibrated by measuring the signal intensi- 
ties of standardized particle sizes to map the measured values of temporal intensity 
steps to physical size. When changes are applied to the optical components of the 
sensor setup, the procedure has to be executed again as they can lead to a general shift 
in the measured signal intensities. 


2.2.6.2 Detection of Viral Infections 

The PAMONO sensor is designed to be usable as a rapid test for detecting viral infections 
on-site. When considered for use at an airport, a city center, or the entrance of a stadium, 
however, additional requirements emerge. For use as a rapid test on-site, the device 
has to be transportable to different places, which is fulfilled by the sensor case being 
suitcase-sized. Energy efficiency can be important for enabling the execution on an 
embedded device and use at otherwise poorly accessible places. A possible solution 
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for miniaturization was evaluated using the Hardkernel Odroid-XU3 and Odroid-XU4 
single-board computers as a basis for calculations [399]. Although execution times are 
increasing, the ability to calculate and manage calculation offloading on embedded 
devices was demonstrated to be feasible [399]. The related method of computation 
offloading is described in Section 2.2.5.2. 


2.2.6.3 Dealing With Imaging Artifacts 

The amplification of imaging artifacts has to be expected when targeting a real-world 
application of the PAMONO sensor. Therefore, an essential aspect of reliable detection 
is the robustness of algorithms against imaging artifacts. 

These artifacts originate from different sources. On the hardware side, there is 
the imperfection of the optical instruments and their adjustments. The smoothness of 
the gold film and its coating also influence the quality of the signals of interest. For 
example, scratches and other irregularities on the surface cause visible artifacts in 
the recordings. Another source that has to be considered is the limited and varying 
cleanliness of samples in real use cases. For example, air bubbles and dust particles in 
saliva or sputum samples cannot be avoided completely. In general, changing external 
influences have to be expected when targeting an application at places where the 
constancy of laboratory conditions cannot be achieved. 

With variations of artifact types, such as line-like or wave-like structures, regions 
of constant intensities, pulsating regional patterns, or random noises, a detection has 
to tolerate both unstructured and structured artifacts. 

This combination causes classic noise reduction methods to yield insufficient 
results. At the same time, changing patterns and intensities of artifacts also pose a 
problem for learning methods since many related approaches require a high amount of 
training data, especially when artifact characteristics differ in the recorded images. 

Recent work tackles both problems by increasing the robustness of an existing 
learning approach relying only on a small amount of labeled training data. 

This is achieved through a generative adversarial network (GAN) [237] that is not 
directly involved in the detection process but learns to simulate real artifact patterns to 
improve the detector before the actual detection takes place. Synthetically generated 
artifacts are used to overlay training images for the detection network. In this way, the 
detector learns how to tolerate the induced artifacts [547]. In addition, the architecture 
of the detection network itself does not need to be changed, so there is no decrease in 
the speed of the detection process. 

The training of the GAN is not free from the demand of training data, but it uses 
reference images. These images are characterized by the fact that they do not contain 
particles of interest but only show imaging artifacts. The advantage of using this type 
of recordings instead of those carrying particle signals is that no physical test particles 
are required. This saves time and material costs and eliminates the risk of reproduc- 
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ing particle signals in the generated artifact images, as particles do not occur in the 
reference images. It is therefore ensured that only artifacts are learned. 

Having demonstrated that it can produce realistic images with sufficient variation 
even with a small amount of training data, the StyleGAN2-ADA [312] was employed as 
the specific GAN architecture. While this GAN requires fixed-size training images and 
produces fixed-size images, different sensor configurations lead to different sizes of 
recorded images. Therefore, the GAN-based approach composes multiple patches of a 
fixed side length to cover the area of an annotated image for the training of the detector. 
As an illustration of the whole process, Figure 2.13 shows the generation of synthetic 
artifact signals and their usage in the training of the detection network. 

Training a simple segmentation network with a downstream object detection shows 
the improvement in the robustness against artifacts when using the presented, GAN- 
based approach. 

Two configurations were evaluated to compare the results with and without over- 
laying synthetic artifacts. The first configuration contained one dataset with particles 
of interest and only weak and unstructured artifacts. The second one uses the same 
dataset but adds synthetic artifacts generated by a GAN trained with reference images. 
Both employ the same test datasets, which contain different types and intensities of 
artifacts. After training two identical U-Net [544] architectures, each with one of the 
configurations, a clear difference becomes visible. While the training results without 
synthetic artifacts yield poor results, overlaying the images improved the mean F1 score 
by 22 % [547]. While improvements can be seen for all types of artifacts, they are most 
significant for images with structured artifacts of high intensities. 

The GAN-based approach shows similarities to the approach presented in Sec- 
tion 2.2.5.1, as both create mixtures of particle signals and other signals from different 
datasets. The difference becomes clear when considering that the GAN-based gener- 
ation does not directly use the limited set of natural images but artificially creates 
an arbitrary number of additional images in a learning process. This procedure can 
significantly increase the variability of the training images. 


2.2.7 Current Research and Outlook 


The techniques around the PAMONO sensor are expected to benefit from further research 
and development in the area of robust object detection. At the same time, it is worth 
evaluating an extension of the applications beyond virus detection, for which the sensor 
also forms a suitable basis. For each of the two perspectives, current and planned 
investigations are described below. 
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Fig. 2.13: Schematic representation of overlaying training images with generative adversarial net- 
work (GAN)-generated artifacts from composite tiles. The PAMONO sensor is used to record samples 
without particles of interest (upper part) and samples including such particles (lower part) for the 
training process. The trained detection model is then used to search for particles in images where 
their presence is unknown. Dashed arrows show the path of images in the evaluation process, while 
solid arrows represent the path of images in the training process. The images in dotted boxes visual- 
ize the single steps by examples. The yellow boxes illustrate the start and end of the pipeline, green 
boxes represent data, and blue boxes mark algorithms [547]. 


2.2.7.1 Improvements of Detection Capabilities 

A method to improve the overall results concerns the physical setup of the sensor. With 
the rotation of the prism and the position of the camera objective influencing the signal 
quality, the appropriate adjustments must be realized before a recording. Although good 
results can be achieved by manual adjustments, there are limits to human accuracy 
when tuning the optical parts. For this reason, a sensor-actuator control system is 
targeted to determine an optimal setting automatically and to adjust the components 
in feedback with the resulting image signals. 

In any case, further development in the direction of an on-site application should 
consider the need for high robustness of the methods against external disturbances 
to ensure a reliable and widely available rapid test. While the synthetic generation of 
artifacts presented in Section 2.2.6.3 showed improvements in the detection robustness, 
temporal dynamics are not specifically taken into account. By including temporal 
information, further improvements in the robustness can be assumed. 

A different way of improving results is the incorporation of domain knowledge. The 
knowledge regarding a specific pathogen suggests the inclusion of particle counts per 
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Fig. 2.14: Examples of different artifact characteristics in preprocessed images after the application 
of dynamic contrast enhancement. 


image area as additional information. When a reference value for a usual concentration 
of those particles is known, the separation between samples with and without particles 
of interest can be improved based on an infection-specific concentration threshold. 
Although this option has not yet been quantitatively evaluated, it promises to improve 
reliability in practical applications. 

A promising area of future exploration of mobile concepts for the execution of 
detection methods is the evaluation of embedded and other mobile devices. For ex- 
ample, the use of Field-Programmable Gate Arrays (FPGA), which are presented in 
Section 6.1 in Volume 1, promises to reduce power consumption while keeping the 
system size small. Another interesting work, which addresses hardware improvements, 
is the evaluation of learning approaches on modern memories in Section 7.2 in Volume 
1. It focuses on the aspect of energy-saving in different approaches by targeting an 
optimal use of memory technologies. 


2.2.7.2 Perspective Applications of the PAMONO Sensor 

The PAMONO sensor proved its power in the characterization of viruses and Virus-Like 
Particles (VLPs). The latter is especially important since VLPs are often used as medical 
products and serve as a basis for different vaccines. Under these circumstances, the 
PAMONO sensor can be applied as an analytical instrument for quality controls during 
routine production of VLP-based vaccines, estimating the size and concentration of 
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produced NPs in real time. Further, the PAMONO sensor represents a valuable analytical 
instrument for the characterization of extracellular vesicles (EVs). In theory, the PA- 
MONO sensor does not simply enable the sizing and quantification of EVs; it also helps 
to gain information about their surface markers and the molecules transported inside 
EVs. Nowadays, EVs earn growing interest as a means of intercellular communication. 
Acting this way, EVs can potentially serve as drug-delivering vesicles and as biomarkers 
of the cellular status. In this case, the ability of the PAMONO sensor to characterize EVs 
without labeling is a promising opportunity. Moreover, the PAMONO sensor may serve 
as a platform for the development of cell-based assays. In this case, the status of cells 
cultured on the sensor surface can be estimated via the cellular production of EVs and 
soluble mediators measured by the PAMONO sensor. 
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Abstract: The past decade has seen unprecedented progress in the survival chances of 
cancer patients as a consequence of new treatments targeting tumor-specific cellular 
processes, which have been uncovered by molecular genetic analyses. From a data 
analysis perspective, the main challenge is the high dimensionality and multimodality 
of the genetic data relative to the small sample sizes (numbers of patients). From 
a computational perspective, the analysis of high volumes of data (about 100 GB of 
sequencing data for an individual tumor genome) currently requires high-powered 
computational resources and still remains challenging in the very short time frames 
that are desired to start treatment immediately. 


We discuss two avenues of progress. First, we present methods that are able to extract 
most of the genetic variants from a sequenced tumor genome, but require only 2% to 5 % 
of the computational resources compared with the current state-of-the-art procedures. 


Second, we discuss a versatile unified statistical model for distinguishing true vari- 
ants from technical artifacts of the DNA sequencing process. 


Using analyses of paired samples from primary and relapse neuroblastoma tumors, 
we are able to extract patterns of tumor evolution that are correlated with cancer 
progression and the escape of tumors from therapeutic intervention. 


As a result, a novel risk classification of neuroblastoma has been established based 
on genomic and mutational data. 


2.3.1 Introduction 


Cancer patients nowadays receive precise diagnosis and personalized therapy based 
on their individual molecular genetic data. 

Here, we report on the analysis of DNA data from patients with neuroblastoma, a 
solid tumor typically occurring in children. 

Diagnostics and prognoses are based on DNA sequencing, currently ranging from 
a few hundred targeted genes to entire genomes requiring 100 GB per patient in the 
near future. 

Identifying relevant variants in the DNA that serve as biomarkers to distinguish 
between different risk classes, or primary tumors from relapses, or treatable versus 
non-treatable tumors, is and remains challenging, but every step of progress in this 
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Fig. 2.15: Left: Evolutionary tree derived from binary features (presence/absence of informative sin- 
gle nucleotide variants) of a primary neuroblastoma tumor and several relapse samples taken from 
different tissues and at different times from the same patient [583]. Right: Illustration of evolution of 
tumor heterogeneity under therapy over time [588]. 


field helps to make better treatment decisions in the long run. The following molecular 
features derived from genomic data are of primary interest: (1) single nucleotide muta- 
tions (or variants, SNVs), (2) short insertions or deletions of DNA, (3) large structural 
variants (e.g., chromosomal translocations), (4) copy number changes (gain or loss 
of genetic material in tumor cells), (5) epigenetic changes, such as DNA methylation 
changes, and (6) differences in gene expression. 

Over the past years, we have developed feature extraction workflows and data anal- 
ysis processes for each type of feature mentioned above. For data analysis workflows in 
general, but especially in medicine, reproducibility of derived data from raw data is of 
utmost importance. The basis of each of these processes is our workflow management 
system called Snakemake [345, 450], which is now widely used worldwide, as it guar- 
antees reproducibility in particular for large-scale DNA sequence analysis workflows. 
Furthermore, the Bioconda package repository [242] was founded by one of us (JK) and 
now, with widespread community support, acts as a central repository for semantically 
versioned bioinformatics software, which is made available in a reproducible way. 

In the following, for simplicity, we focus on the first type of features (SNVs), but 
these findings also translate to the other variant types, if appropriately adjusted. In 
particular, we discuss whether we can determine genomic variants that distinguish 
primary neuroblastomas from those that re-occur after therapy (referred to as relapses 
or relapse samples). The latter are responsible for adverse disease courses and are 
currently considered to be incurable. It was therefore highly encouraging that we were 
able to identify several genes with recurrent mutations present only in relapse samples 
[583]. Figure 2.15 summarizes some of our key findings on tumor heterogeneity after 
relapse (left side) and illustrates the tumor evolution process (right side). It is mainly 
this developing molecular heterogeneity of tumor cells under treatment that currently 
prohibits effective long-term therapies. 

The main resource constraints for this setting and similar situations are a limited 
number n of samples (patients) versus an extremely high number p of potential features 
(e.g., each potential variant in the genome observed in at least one sample). 

So we face two challenges in particular: 


2.3 Cancer Diagnostics and Therapy from Molecular Data —— 45 


1. resource-efficient detection of candidates of variants (Section 2.3.2) 
2. accurate classification of candidates in each sample (true variant vs. noise, techni- 
cal artefact, etc.; Section 2.3.3) 


2.3.2 Resource-Efficient Detection of Variant Candidates 


Standard genetic mutation or variant analysis starts with an extremely compute- 
intensive step: the localization of every single sequenced DNA fragment (or “read”; 
there are literally millions of DNA reads in a single dataset) in the genome, and a 
pairwise comparison between the fragment and the genomic sequence. Such pairwise 
alignments are the basis of variant calling: many reads showing a certain difference 
at the same position compared with the reference genome, this provides convincing 
evidence that the sequenced genome contains a specific genetic variant at that position, 
either in both inherited chromosome copies (homozygous variant) or in just one (het- 
erogzygous variant). To be precise, complex statistical models and tests are necessary 
to distinguish true variants from possible technical artifacts (see Section 2.3.3). 

This first localization and comparison step is performed by so-called read mappers, 
such as BWA-mem [391], bowtie2 [371], minimap2 [390], or PEANUT [344]. Extensive 
parallelism on both multi-core systems and GPUs keep the (wall clock) time of this 
step within a few hours. However, the overall CPU work consists of many CPU days or 
months for a single dataset, consuming considerable energy. 

It is therefore of high interest to develop more resource-frugal methods to achieve 
the same task, or at least a large fraction of it. We explored alignment-free methods 
as an alternative to the above mapping and alignment-based method. In particular, 
we propose to use short DNA strings of length k (so-called k-mers) to directly detect 
potential single-nucleotide variants, as we now describe. 


2.3.2.1 Genome Preprocessing 

We first preprocess the reference genome. 

1. Select an appropriate value for k, such that most k-mers are unique in the reference 
genome. Our studies indicate that 21 < k < 31 works well for the human genome 
[517]. 

2. Build a (very large) hash table of k-mers in the human reference genome and the 
number of times that they occur. We need to take into account that double-stranded 
DNA is equivalent to its reverse complement. 

3. Mark the unique k-mers; they point to a unique position in the genome. 

4. Among the unique k-mers, mark those that are robustly unique against single 
substitutions, i.e., those for which no Hamming-distance-1 neighbor also occurs in 
the genome. The resulting robustly unique k-mers do not only point to a unique 
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location in the genome; they also cannot easily be changed into k-mers that occur 
at a different genomic location. 


The alignment-free methods work either exclusively on the robustly unique k-mers or 
on all unique k-mers, giving the information from the robustly unique k-mers a higher 
weight. This pre-processing step has to be performed only once for any genome version. 

We found that multi-way bucketed Cuckoo hash tables are ideally suited for the 
task, as they allow relatively quick construction times and yield very fast lookup times 
later. There are smooth trade-off options between lower memory usage and even faster 
lookup times. 

In a preliminary study [756], we designed and implemented these hash tables for 
a simpler application than variant calling: xenograft sorting. Here, a human tumor is 
engrafted into another organism (typically a mouse) to be able to study its evolution 
and response to different therapies. When such a tumor is sequenced, one obtains a 
mixture between human and mouse DNA reads, so that all reads have to be assigned to 
the organism of origin before proceeding further. This assignment is called xenograft 
sorting. Even though human and mouse are quite similar on a genetic level, they can be 
sufficiently well distinguished on the k-mer level. We presented a classification method 
based on k-mer hash tables, as outlined above, with extremely high accuracy, but using 
much less CPU work than previous methods: less than 25 % of comparable hash-based 
methods and less than 5 % of classical alignment-based methods [756]. We additionally 
showed that the placement of keys in the hash table can be optimized to yield optimal 
average look-up times (based on the number of random memory accesses, i.e., likely 
cache misses), saving 10 % to 15 % of CPU work for each sample (after a 48 CPU hour 
optimization procedure that has to be run only once) [757]. 


2.3.2.2 Basic Alignment-Free Variant Calling 

The underlying idea of this method is as follows: We count all the k-mers in a sequenced 
sample and produce a histogram of the count values. A typical (unique) k-mer should 
have a copy number of two (in a diploid genome) when no variant is present. We 
therefore analyze the histogram of observed k-mer counts (Figure 2.16) from the sample. 
The leftmost peak (counts near zero) can be explained by rare k-mers due to sequencing 
errors or contamination; we can attempt to correct these, or ignore them entirely. We fit 
a negative-binomial mixture model to the remaining peaks occurring at equidistant 
counts. The main peak corresponds to a copy number of 2 in a diploid genome (from 
k-mers present in both the maternal and paternal chromosome set). 

The initial analysis is restricted to the robustly unique k-mers from the reference 
genome. We expect that each such k-mer has a copy number of either 0 (homozygous 
variant), 1 (heterozygous variant) or 2 (no variant) in the sample. Higher copy numbers 
could be explained by segmental duplications, which we do not consider at this point. 
If we suspect a variant, we look for isolated single nucleotide variants, i.e., k-mers with 
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Strongly unique 25-mers from GRCh38 counted in HG001 reads (GIAB Illumina WGS 2x150bp, 270 GB FASTQ) 
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Fig. 2.16: Illustration of a k-mer count histogram, relating observed k-mer counts (x-axis) to their 
frequency (y-axis; logarithmic). The leftmost peak (close to zero) represents noise and erroneous 
k-mers, mostly due to sequencing errors. The main peak (near 80, approximately the sequencing 
coverage of this example) corresponds to the standard copy number of 2. The shoulder (near 40) 
then corresponds to a copy number of 1 and consists of k-mers that are part of heterozygous isolated 
mutations. This histogram was created from a control sample; in a tumor sample, more irregularities, 
especially additional peaks at higher copy numbers, can be expected. 


a Hamming distance of 1 to the reference k-mer, among the k-mers in the sample. If 
we find a unique one (with the expected copy number), we store the pair of reference 
k-mer and modified k-mer as a candidate for a variant. 

This process can be implemented very efficiently, and in addition, it can be trivially 
parallelized. It yields candidates for Single Nucleotide Variants (SNVs) that then can 
be checked by statistical methods (see next section). It can also reliably detect copy 
number variants on long segments. However, it cannot easily detect more complex 
variants, such as two SNVs in close proximity, short indels, or structural variants: Here 
translating k-mer information into an exact variant is more difficult, but can resort to 
alignment-based methods for the local regions around areas with suspicious k-mer 
frequency structure. 


Perspectives Alignment-free variant calling is still an active research area, and while 
we made contributions to the underlying data structures (engineered Cuckoo hash 
tables) and were successful in calling selected SNVs, further ideas are necessary to call 
larger classes of variants reliably. Possible approaches include using locality sensitive 
hashing, in particular min-hashing, instead of exact k-mer hashing, combined with 
hybrid methods between alignment-free and alignment-based approaches. To assess the 
potential of min-hashing-based methods, we conducted a detailed statistical feasibility 
study, examining when it is useful to include known variants into a k-mer-based read 
mapper (and when not; see [515]), paving the way for novel approaches. 
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2.3.3 A Unified Statistical Model for Genomic Variant vs. Artifact Classification 


We present an extension and generalization of a latent variable model originally pub- 

lished by Köster, Dijkstra, Marschall, and Schénhuth [342]. 

For this, we consider a set S of samples. Samples can be related with each other in 
three ways. 

1. There can be clonal inheritance between two samples s1, S2 € S: sample s, inherits 
all constitutive genetic variants of sample s2. In addition, the tissues of origin of 
both samples may have developed their own somatic mutations during their lifetime 
until sequencing. 

2. There can be Mendelian inheritance [443] between samples s1, 52, $3 € S: the indi- 
vidual of origin of sample s, inherits constitutive genetic variants of two parental 
individuals (s2 and s3). 

3. Asample s € S can be contaminated with a fraction of another sample s’ € S. 


We represent the three relationships in a directed graph G = (S, Ic, Im, C) (the sample 
graph) with edge types Ix C S x S for clonal (x = c) and Mendelian (x = m) inheritance 
as well as C C S x S for contamination. The corresponding contamination fraction can 
be obtained with c : C > [0, 1]. 

The above representation can be used to model the three classical cases of genomic 
variant calling: single-sample or population calling (the graph has no edges) [171], 
pedigree based family variant calling (Mendelian inheritance edges) [171], and calling 
of tumor/normal sample combinations (clonal inheritance and contamination edges) 
[342]. Importantly though, instead of being limited to these, it can reach beyond them 
by combining the mechanisms into more complex scenarios. 


2.3.3.1 Variables and Notation 

Observed Variables For each potential genomic variant of interest, we observe se- 
quencing read data Zs = (Z}, ..., Zi). If the read data consists of so-called paired-end 
reads (each investigated DNA fragment is sequenced from both ends), each observation 
in Z} € Zs isa tuple Z? € ({A,C,G,T}*, {A,C,G,T}*, N), with the first and the second 
element denoting the nucleotide sequence of the read and the last element denoting 
the so-called observed insert size, that is, the number of bases from the leftmost to 
the rightmost covered base when aligning the read pair to the most likely position of 
origin on the reference genome of the investigated species. If the read data consists of 
so-called single-end reads (each investigated DNA fragment is sequenced just from one 
end), each observation Z? € Z is simply the nucleotide sequence of the read, in other 
words Z? € {A,C,G,T}*. 


Latent Variables The central readout of our model is the allele frequency in each 
sample s, denoted as latent variable 0s € [0, 1]. For each read observation i, there is 
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a binary latent variable ¿f with ¿f = 1 denoting that the observation originates from 
the variant allele (i.e., from a genome copy hosting the variant under consideration) 
and ¿f = 0 denoting that the read originates from the reference allele (i.e., from a 
genome copy hosting exactly the same sequence as in the reference genome of the 
corresponding species). In addition, a binary latent variable w?, denoting whether the 
observation has been aligned to the correct (w? = 1) location of origin in the reference 
genome, is used. 


Extensions for Bias Estimation The model can be further extended in order to esti- 
mate biases in additionally observed properties of the read data, that is, the strand, the 
read position supporting the variant, the read orientation, and whether the alignment 
against the reference genome covers the entire read. Biases from an equal distribution 
in the observed values of variant supporting reads for any of these properties typically 
indicate an artifact. For clarity and brevity, we omit the integration of these biases in 
our model here. An integration of strand bias can be already found in [342]. 


2.3.3.2 Latent Variable Model 

In the following, we briefly introduce the latent variable model used for calculating 
allele frequency likelihoods that has been published recently [342], and then provide a 
generalization of the method. When evaluating if a read deviates from the reference 
genome, two types of uncertainty are to be considered. First, there is alignment uncer- 
tainty: often, a read can be aligned at multiple loci in the reference genome (also see 
Section 2.3.2). 

Depending on their similarity, there is more or less certainty about the optimal 
positioning of the read. Read mappers and alignment tools, such as BWA [391], report 
this uncertainty as mapping quality (MAPQ), which can be translated into a probability 
7; associated with each read observation Z? to be aligned to the correct locus. Second, 
there is typing uncertainty: the observed read sequence is not a perfect representation of 
the true DNA fragment that has been sequenced, but instead a measurement entailing 
potential errors and artifacts. The DNA sequencing machine provides an estimate of 
the certainty of each reported base as the so-called base quality, which can again be 
translated into a probability of the reported base to be correct. In addition, depending 
on the sequencing technology, there are known rates of false insertions or deletions 
of bases in the reported read sequences, as remarked on for example by Schirmer, 
D’Amore, Ijaz, Hall, and Quince [578]. 

We now model the relationships between our observed and latent variables, while 
taking above mentioned uncertainties into account. For each observation Zf in sample 
s, we handle alignment uncertainty by defining the distribution of the latent variable 
Wj; as 


wF ~ Bernoulli(z?). (2.3) 
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The distribution of the latent variable ¿f depends on the expected fraction of observa- 
tions from the variant allele. If s is not contaminated by another sample, we define 


č? ~ Bernoulli(6sr). (2.4) 


Thereby, t € [0, 1] denotes a sampling bias that occurs because it is usually harder to 
obtain observations from the variant allele: it is harder to align, and depending on the 
size of the variant, harder to obtain reads that sufficiently cover it [342]. If, in contrast, 
s is contaminated by a s' (i.e., e = (s,s) € C) with fraction a = c(e) we define 


È? ~ Bernoulli(a@st + (1 -a)y T). (2.5) 


In other words, the expected fraction of observations from the variant allele becomes a 
mixture of the allele frequencies in s ands’. 
Then, typing uncertainty can be modeled as 


pif g? = 1,07 =1 
Zi |é wi ~ 4 aj if ëS =0, w$ =1 (2.6) 
oiif èF =0, wF =0. 


Here, a;, pi, and o; are probability distributions modeling the case that the observation 
comes from a genome copy where the variant is present (p;), absent (a;), or from a 
different locus (0;). These can be computed using Pair Hidden Markov models, which 
essentially realign the read sequence against the sequence of reference and alterna- 
tive allele while statistically considering sequencing error rates, as shown in Köster, 
Dijkstra, Marschall, and Sch6nhuth [342] for deletions and insertions. Since then, via 
analogous approaches, our model has been extended to also support all other common 
variant types ranging from small (SNV, MNV) to structural variants such as inversions, 
duplications, and arbitrary chains of breakpoints. 

By combining the above relations, the model can be used to calculate the likelihood 
of a given combination of allele frequencies of samples S = {s;,..., Sn} as 


n Zs; | 
Pr(Zs,, ...,Zs, | Os, PEA Os,,) = II Il Pr(Z;’ | Os, rea 6s.) (2.7) 


jel i=l 


while assuming independence between the read observations. Note that the computa- 
tion of the likelihood function is linear in the total number of read observations, as we 
have shown previously [342]. 


2.3.3.3 Prior Distribution 

The prior probability of a given allele frequency combination 0s,,..., 4s, in our gener- 
alized model can be computed by considering the dependencies between the samples 
modeled by the sample graph G (see beginning of Section 2.3.3). In addition, we assume 
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that for each sample s € S, a ploidy ps € N (which may differ by chromosome, e.g., 
it may be sex-specific), a somatic effective mutation rate ws € [0, 1], and a germline 
mutation rate vs € [0, 1] are known. For calculating a prior probability, the key is to 
explain the total allele frequency s by a germline allele frequency ts and a somatic 
allele frequency |s — ts|. Usually, one of the two will be zero, such that variants are 
explained either by germline or by somatic mutation, but combinations thereof can also 
happen in rare cases. From the known ploidy ps of a sample s; € S, we can calculate 
the set of possible germline allele frequencies ¢s C [0, 1]?s*1. For example, for ps, we 
obtain ¢; = {0, 0.5, 1}; in other words, any germline variants may occur either in no, 
one allele (0.5 or 50 %), or two alleles (1.0 or 100 %). The prior probability can then 
be calculated by recursively exploring all possible explanations of a given total allele 
frequency combination. 


For a combination of germline and somatic allele frequencies we can then distinguish 

between the following cases: 

1. All samples that are not direct descendants of other samples (have no incoming 
edges in I; and Im in the graph G) are considered a population and the prior proba- 
bility of their combination of germline allele frequencies is calculated, as defined 
by DePristo et al. [171], based on a so-called heterozygosity (i.e., the expected 
proportion of heterozygous sites in the genome), which is usually known for the 
investigated species. 

2. Foranysamples € S that inherits clonally from another samples’ € S, we calculate 
the prior probability for the somatic allele frequency f = |0s — 1y | according to the 
method of Williams, Werner, Barnes, Graham, and Sottoriva [730], who report a 
formula for the expected cumulative number of somatic mutations per frequency. 
The latter can be translated into the corresponding density by normalizing with the 
genome size g and taking the first derivative, resulting in h(f) = Pz for f > 0. In 
order to also be able to calculate the probability for f = 0, we define a reasonably 
small e and define h(0) = 1 - f h(f)df. 

3. For any sample s € S that inherits in a Mendelian [443] way from two parents 
s € Sand s” € S, we first calculate the number of expected constitutive alternative 
alleles in the child and the parents by multiplying the ploidy with the respective 
germline allele frequency, i.e., ps + ts. We then sum over the probabilities of all 
cases of inheriting chromosomes with or without the variant allele from the par- 
ent samples that could explain the expected constitutive alternative alleles. The 
individual probabilities can be calculated by modeling an urn drawing process 
without replacement, yielding a hypergeometric distribution. Finally, additional 
somatic variation, i.e., cases where Os, — ts; # 0, are handled by multiplying the 
corresponding prior probability for the somatic allele frequency. 

4. Finally, sometimes it might not be possible to formulate prior assumptions about 
allele frequencies of asample s € S. In such cases we specify an allele frequency 
universe Us C [0, 1] fora sample and assume a uniform distribution. 
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By taking the product over the priors for individual or groups of samples derived from 
distinguishing the above three cases, the prior probability for any combination of 
germline and somatic allele frequencies can be obtained. 


2.3.3.4 Variant Calling Grammar 

The above model is implemented in the software Varlociraptor (https://varlociraptor. 
github.io). Varlociraptor offers a variant calling grammar that allows to define a scenario 
that configures all aspects of the model (prior parameters, sample graph) via a textual 
representation in YAML format (YAML Ain’t Markup Language; https://yaml.org/). A 
scenario consists of the following sections. 


Species In this section, general prior knowledge about the investigated species is 
defined, such as the heterozygosity (see Section 2.3.3.3) and the ploidy (number of 
chromosome copies in a cell). The latter may be defined with sex-specific exceptions 
(such as the X and Y chromosome distribution in humans). 


Samples In this section, the samples and their dependencies (i.e., the sample graph) 
are defined. For each sample, it is necessary to either define an allele frequency universe 
(leading to a uniform prior across the defined frequencies) or the sex. In the latter case, 
ploidy and heterozygosity are taken from the species definition and used to configure 
the prior accordingly. Each sample may be annotated with a contamination by another 
sample in a given fraction (this can be used to define the common case of having a 
tumor sample that also contains healthy normal tissue). Finally, each sample may 
define a type of inheritance (Mendelian or clonal), while referring to the corresponding 
parental samples. 


Events The heart of a scenario is formed by the definition of mutational events of 
interest. These can be used to define any kind of Boolean logic expressions over allele 
frequencies (discrete or intervals) in the given samples. 

An example for a scenario modeling the calling of variants in a patient for which 
a normal healthy blood sample, a tumor sample, and a relapse sample is used can 
be seen in Figure 2.17. Here, for simplicity, we have initially not defined any prior 
knowledge regarding mutation rates etc., thereby modeling a uniform distribution 
over the defined allele frequency universes. An equivalent scenario including this kind 
of prior knowledge is shown in Figure 2.18. Here, it can be seen that we are able to 
define inheritance between the normal and the tumor sample. For the relapse sample, 
although in principle it should inherit mutations from the tumor sample, it is unknown 
to what extent this happens, because usually only one or a few subclones survive the 
therapy. Hence, we refrain from specifying an inheritance between the tumor and the 
relapse, and instead impose a uniform prior on the possible allele frequencies in the 
relapse sample. 
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Fig. 2.17: Example of a Varlociraptor scenario specification to distinguish between germline variants 
and those occurring as somatic events in the primary or relapse sample. (a) Scenario definition 

via Varlociraptors variant calling grammar. The first section defines the three involved samples 
normal healthy blood, primary tumor, and relapse after therapy, along with their contaminations and 
expected allele frequency universe. The second section defines the events of interest via Boolean 
logic formulas. (b) The resulting structure of the latent variable model, automatically derived from 
the scenario definition. (c) Visualization of the expected allele frequencies in the three samples for 
each defined event. 


2.3.4 Application and Results 


It was previously shown that Varlociraptor is able to significantly improve the recall, 
while precisely controlling the false discovery rate without the need to tune any tech- 
nical filter parameters in the absence of a biological interpretation [342]. Here, we 
illustrate the application of the model by re-analyzing the aforementioned previously 
published neuroblastoma dataset [583]. In this manuscript, we analyzed genomic data 
from 17 neuroblastomas, for which DNA was available from the primary tumor and 
the tumor at relapse. Obtaining the sequence of the entire coding region of the hu- 
man genome (usually referred to as the “exome”) was especially useful for modeling 
intra-tumor heterogeneity and clonal tumor evolution. 

We use the normal-tumor-relapse model formulation from Figure 2.18 and para- 
metrize it as follows. The effective somatic mutation rate in the tumor sample is set 
to 2.93 -10°°. This roughly models the expectation of at most 100 de-novo somatic 
mutations in typical neuroblastoma tumors found in our original study [583]. 

Since somatic mutation can also appear in the normal tissue, we set the correspond- 
ing effective somatic mutation rate to 2.8 - 1077, as reported by Oota [485]. Finally, the 
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species: 
heterozygosity: 0.001 
genome-size: 3.1e9 
ploidy: 
female: 

all: 2 

X: 2 

Y: 0 


samples: 
normal: 
sex: female 
somatic-effective-mutation-rate: 2.8e-7 
tumor: 
sex: female 
somatic-effective-mutation-rate: 2.93e-6 
inheritance: 
clonal: 
from: normal 
contamination: 
by: normal 
fraction: 0.1 


relapse: 
resolution: 100 
universe: "[0.0,1.0]" 


contamination: 
by: normal 
fraction: 0.53 


events: 
germline: "normal:{0.5,1.0}" 
somatic_normal: "“normal:]0.0,0.5[" 
somatic_tumor: "normal:0.0 & tumor:]0.0,1.0]" 


somatic_relapse: "normal:0.0 & tumor:0.0 & relapse:]0.0,1.0]" 


Fig. 2.18: Extension of the Varlociraptor scenario specification in Figure 2.17 to include prior knowl- 
edge. We define the species (here Homo sapiens) in terms of genome size, heterozygosity (expected 
fraction of heterozygous loci), and sex-specific ploidy (number of chromosome copies). In addition, 
we model known somatic mutation rates, and define that the tumor inherits germline mutations from 
the normal sample. 


tumor and the relapse sample tissue is usually contaminated by healthy cells. We use 
the amounts of contamination reported in the original study [583]. 


Workflow Analyzing sequencing data for genomic variants entails a variety of steps, 
which we outline in Figure 2.19. The entire analysis is implemented as a Snakemake 
workflow [343]. 

First, raw reads are processed by (a) trimming so-called sequencing adapters, (b) 
mapping them to the reference genome of the corresponding species, (c) removing 
putative duplicates from the Polymerase Chain Reaction (PCR), and (d) recalibrating 
base qualities. Sequencing adapters (a) are non-biological artifacts of the sequencing 
process. Since they are known beforehand, they can be removed from the reads by 
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Fig. 2.19: Schematic representation of applied genomic variant calling workflow. Nodes represent 
original or derived data (gray labels in the left column), arrows represent processing steps (black 
labels in the left column). 


performing an error-tolerant alignment between each read and the known sequence. 
We use Cutadapt to perform this step [431]. By mapping reads to the reference genome 
(b), we obtain the correct order and individual differences of each read compared with 
the representative genome of the underlying species. The resulting read alignments 
already contain all necessary observations for applying the Varlociraptor model. In 
order to obtain a signal of sufficient strength, sequencing protocols often entail the 
amplification of the DNA material via polymerase chain reaction [24]. The result is that 
there can be multiple reads from the same DNA fragment. Since Varlociraptor assumes 
each read to be an independent observation, it is important to remove such putative 
PCR duplicates, which we achieved using Picard tools [500]. Finally, the sequencing 
process sometimes causes artifacts to appear next to certain motifs [19]. In (d), we 
therefore use the base recalibration process from the Genome Analysis ToolKit (GATK 
[171]), which systematically investigates base alteration causing motifs and recalibrates 
the per base confidence scores in each sequencing read to reflect the uncertainty about 
whether an altered base is a true signal or a motif-induced artifact. 

Second, the aligned reads are used to generate candidate variants. We use the tools 
Freebayes [222] and Delly [523] for this purpose. While the former covers small variants 
that can be covered by a single read (SNVs, MNVs, small insertions and deletions), the 
latter covers large, structural variants (large insertions and deletion, inversion, and 
duplications). Importantly, while both Freebayes and Delly provide their own statistical 
models for calling variants, we utilize them to generate candidate hypotheses. Both 
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Fig. 2.20: Mutational burden of patient 1 from Schramm et al. [583] in the primary tumor (left) and 
relapse sample (right). The horizontal axis shows the minimum allele frequency, vertical axis shows 
the mutational burden as number of coding somatic mutations (calculated as expected value over 
the posterior probability for having a somatic mutation) per megabase of coding genome. The colors 
represent different types of mutations (see legend). 


models are designed only for specific cases and are not generic enough to handle the 
composition of samples available in this dataset. 

Third, we use Varlociraptor to (a) extract observations for each sample and each 
candidate variant and (b) apply the model as defined in the corresponding scenario for 
each patient in the study data. 

Fourth, we (a) annotate the variant calls from Varlociraptor with their impact on 
proteins via the VEP tool [440] and (b) filter them for those that are of interest. In this 
case, we strive for three disjoint sets of variants 
1. Variants that have been previously described as pathogenic or likely pathogenic in 

other studies. 

2. Variants with high impact on the protein but which have not been previously 
described by other studies. 

3. Variants with moderate impact on the protein but which have not been previously 
described by other studies. 


Finally, we separately control the local false discovery rate for somatic variants in either 
the tumor or the relapse sample on each of the three sets. 


Insights Inthe following, we summarize the most important insights from reanalyzing 
the study data with this workflow. 

Figure 2.20 shows the mutational burden as a curve over the minimum allele 
frequency on an example patient. It can be seen that the burden for higher frequencies 
in general increases in the relapse sample compared with the tumor sample. This 
supports the hypothesis that the relapse sample originates from a subclone of the tumor 
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sample, which has survived therapy. Thus, one can expect that resistance-inducing 
mutations in the relapse sample become more abundant. Our findings contribute to the 
emerging view of resistance to cancer therapies as an evolutionary process. Selection 
of surviving clones results in mutational fingerprints that are specific for resistant or 
recurrent tumors. A better understanding of these genetic fingerprints is a prerequisite 
for identifying markers allowing early detection of resistance or tumor recurrence and 
enabling timely adjustment of therapies to further improve the survival and cure of 
cancer patients. 

Future work entails the interpretation of individual recurrent deleterious gene and 
pathway alterations across the analyzed samples. Moreover, we aim to further improve 
the prior model of Varlociraptor such that assumptions about subclonal inheritance 
patterns can be incorporated as well. 

Finally, we will combine the statistical approach of Varlociraptor with alignment 
free methods, as outlined in Section 2.3.2. Since Varlociraptor has to perform a realign- 
ment of read sequences anyway (see Section 2.3.3.2), we may replace the initial read 
alignment with an alignment free approach that yields a rough positioning of reads 
on the reference genome so that they can be selected for validating a given candidate 
variant with Varlociraptor. For this, it is necessary to accurately estimate the alignment 
uncertainty from the k-mer hits via, say, the strategy proposed in our previous work on 
PEANUT [344]. Finally, the detection of candidate variants with alignment free methods 
has to be extended beyond single nucleotide variants. Here, a possible strategy might 
be a hybrid approach where aberrations in k-mer counts are translated into an exact 
variant call by (a) collecting the causing reads, (b) assembling them into one or more 
consensus sequences [114], and (c) aligning these against the reference genome to 
determine the nature of the variant. 
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Abstract: In this contribution, we will present Bayesian approaches for dimensionality 
and complexity reduction in the context of health-related problems. Following an in- 
troduction to Bayesian analysis in general, we will first show two examples of Bayesian 
variable selection methods for reducing the number of variables, one for binary data 
with an application to Single Nucleotide Polymorphisms (SNPs) for the HapMap dataset, 
and one for time-to-event endpoints with an application to glioblastoma data from 
the Cancer Genome Atlas. Second, we will present an approach for reducing statistical 
models, where we transfer the Merge & Reduce principle to maintain statistical sum- 
maries in streaming models. The variable selection approaches as well as the Merge & 
Reduce approach are important steps towards resource-aware data analyses. 


2.4.1 Introduction 


Machine learning obtains predictions by constructing a predictive model. Many ma- 
chine learning algorithms are built on black-box models, as opposed to statistical 
learning, which is more concerned with the statistical properties of the model, such 
as the distribution of the variables and its parameters. In this section, we focus on 
the problem of data- and dimensionality-reduction in Bayesian statistics. Bayesian 
regression does not assume a fixed optimal solution for a dataset as in the frequentist 
case, but introduces a distribution over the parameter space. The likelihood function 
models the information that comes from the data, and the prior distribution models 
problem-specific prior knowledge. Our goal is to explore and characterize the posterior 
distribution, which, as a consequence of Bayes’ theorem, is a compromise between the 
observed data situation and the prior knowledge that we assume for the parameters. For 
very large and high-dimensional datasets and settings where computational resources 
are scarce, the posterior distribution is hard to obtain. In general, our work focuses on 
algorithmic approaches that can be implemented in streaming and distributed environ- 
ments to reduce the underlying problem in order to enhance the scalability of modern 
Bayesian regression approaches. In this contribution, we particularly concentrate on 
biomedical applications. Depending on the large-scale high-dimensional problems at 
hand, our interest centers around a) reducing the number of observations, b) reduc- 
ing the number of variables, or c) reducing the underlying statistical model. Task a) 
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comprises approaches such as sketching or coreset methods. We have made various 
contributions on this topic; see, e.g., Geppert et al. [226] or Munteanu et al. [463]. For 
more details on this topic, see Section 3.2 in Volume 1 (“Coresets and Sketches for 
Regression Problems on Data Streams and Distributed Data”). Task b) is particularly 
relevant for high-dimensional settings in which the number of variables exceeds the 
number of observations. This is the main problem in many genetic applications. We 
will present two variable selection approaches for reducing the number of variables, 
one in the context of a single data source for Bayesian (logic) regression, one in the 
context of integrating multiple genomic data sources in a Bayesian Cox model. For task 
c) of reducing statistical models, we transfer the Merge & Reduce principle to maintain 
statistical summaries in streaming models (Geppert et al. [227]). We can compute the 
necessary results for a regression model by analyzing them blockwise and combining 
the summaries of each block in a structured way; more details are given in Section 2.4.3. 


2.4.2 Variable Selection 


The variable selection topic is one of the central problems in modern statistics. In some 
fields, researchers are often faced with the problem that the number of variables is 
larger than the sample size, which is commonly known as the p larger than n problem. 
In this case, training statistical models directly on the original dataset may lead to many 
problems, such as overfitting or the inability to use the Ordinary Least Squares (OLS) 
regression method. Therefore, selecting only those variables that are truly informative 
becomes a very important step in the process of constructing a model. 

A popular approach for variable selection is to make the model sparse by forcing 
the coefficients of some variables to be zero or converge to zero during the regression 
process by L, or L2 regularization. Another often-used approach is usually seen in the 
ensemble algorithm: all variables are first included in the model, and then the variables 
are ranked according to their importance by a variable importance measure, and the 
variables that have an impact on the dependent variable are selected after modeling. 
The following two approaches are well suited for medical applications: a variable 
importance measure approach employed prior to model building, and a regularized 
modeling approach using suitable priors and a stochastic search for variable selection. 


2.4.2.1 Variable Selection for High-dimensional Binary Data 

For high-dimensional binary data, e.g., genetic marker data, interactions between 
variables are often more important than main effects, which increases the number 
of variables even further. In this section, we describe how to use so-called leverage 
scores and cross-leverage scores as measures of variable importance to select subsets of 
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explanatory variables while retaining valuable interactions. The leverage and cross- 
leverage scores are also compared with the more popular variable selection criteria 
correlation coefficients and p-values for univariate linear regressions. In contrast to 
common variable selection criteria, our approach focuses on variable selection prior to 
building the model. For more details on our method, see Parry et al. [492]. 

Obtaining the leverage and the cross-leverage scores requires the calculation of the 
hat matrix of the data matrix, which is a projection matrix that carries the information 
about the impact of variables instead of observations or responses. 

For a data design matrix X and response vector y, we set 


Š =[xX,y]7 eRe (2.8) 
and obtain the hat matrix 
H = X(X'X) UXT c ROOD, (2.9) 


Calculating the hat matrix requires the calculation of the inverse matrix (TŘ), which 
is not a stable calculation (see, e.g., Geppert et al. [226]). An alternative method is to 
obtain the hat matrix by QR-decomposition. 

The if element on the main diagonal of the hat matrix for the first p elements 
i= 1,...,p is an importance measure of the it variable, and is also known as its 
leverage score (see, e.g., Drineas et al. [181]). The cross-leverage scores are obtained by 
the off-diagonal elements of a hat matrix (Chatterjee and Hadi [122]) and describe the 
mutual influence of the it" variable on the j} one for j = 1,..., p and on the response 
variable forj = p +1. 

When selecting the most important variables prior to building the model, one 
important consideration is how many variables should be included in the pre-selection. 
Parry et al. [492] point out, based on the available literature, that using O(nInn) isa 
valid indicator for the optimal number of the pre-selected variables and that choosing 
that many variables allows the subsampled data matrix to remain at full rank. 

For studying the relationship between binary variables as well as their interactions 
with respect to a response of interest, Ruczinski et al. [550] developed the so-called 
logic regression, an adaptation of a generalized linear regression model. In logic regres- 
sion, the binary variables are not employed directly as independent variables, but are 
combined using the logical operators ^ (and), v (or), and the negation !. The resulting 
Boolean expressions L; are called logic trees, and the corresponding logic regression 
model can be written as: 


T 

s(E(y)) = Bo + X Bilis (2.10) 
i=1 

where g(-) is the link function of the generalized linear model and L; fori € 1,..., T 

are the Boolean combinations of the binary variables. In our genetic application, we 

employed the logit link function. The models are fitted with an iterative search algorithm 

based on Simulated Annealing (see, e.g., [550]). 
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While logic regression models are well suited for analyzing the association of binary 
variables and their interactions with a response, a drawback is that they can handle 
only a limited number of variables. Here, our approach using leverage or cross-leverage 
scores is helpful, since it reduces the number of independent variables prior to the 
analysis, while keeping the most influential ones. 

We evaluate our approach using simulated data from different scenarios as well 
as the real HapMap dataset. We compare the leverage and cross-leverage scores for 
variable selection with correlation coefficients and p-values as selection criteria. In our 
simulation study, we have created different data scenarios in order to explore the ability 
of these variable selection measures to select the correct variable when the number 
of independent variables and the sample sizes vary. We also study these measures for 
main effects and for the presence of different higher-order interactions. 

The genetic HapMap data was collected as part of an international collaboration to 
develop a haplotype map of the human genome. The subset considered in our work is 
available in the R-package SNPassoc [234]. For our analysis, we considered 7648 human 
single-nucleotide polymorphisms (SNPs) employed as binary variables for n = 120 
individuals from two separate ethnic groups. The dependent variable was set to 1 when 
an individual is from central Europe (CEU), and 0 when it belongs to the Yoruba group 
(YRI) that inhabits western Africa. 

For the simulated data, we show the results by plotting the distributions of lever- 
age scores and cross-leverage scores for the different scenarios (see Parry et al. [492]). 
The leverage scores perform better in selecting important variables when only main 
effects are present, whereas cross-leverage scores distinguish variables of higher-order 
interactions better. A change in the number of variables p did not have much effect on 
the overall performance of the two measures when the number of irrelevant variables 
increased. By contrast, when the sample size increases, both scores for the informative 
variables increase while both measures for the irrelevant variables decrease, meaning 
that more samples make it easier to find the truly informative variables for both mea- 
sures. Moreover, we compare the leverage and cross-leverage scores to the correlation 
coefficients and to p-values. 

We also propose and investigate a way of combining cross-leverage scores and 
leverage scores by selecting the variables using these two metrics separately and taking 
the union of all selected variables. We show that this combined approach is superior to 
using just one metric alone (see Parry et al. [492]). 

For the HapMap data we present here a raster plot (see Figure 2.21). It is a graphical 
representation of a matrix containing SNP data, where the rows correspond to indi- 
viduals and the columns correspond to SNPs. We take subsets of size [1201n(120)] = 
575 of the most important SNPs using cross-leverage scores (CLS), leverage scores (LS), 
correlations (COR), and p-values, respectively to be comparable. We can easily distin- 
guish the two groups using cross-leverage scores; thus, the selected SNPs can be used 
to classify individuals of the two groups. Using leverage scores, it is almost impossible 
to distinguish the two groups. The variables selected using correlation coefficients or 
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Fig. 2.21: Raster plots of all 120 subjects from two continental regions with respect to a subset 
of 575 out of 7648 SNPs (green/light pink denotes homozygotic individuals and white denotes 
heterozygotic individuals), selected using CLS, LS, COR, and p-values, respectively. 


p-values show hints of blocks corresponding to the two groups, but much less than 
those selected by the cross-leverage scores. 

Based on our simulation study and the real data example, cross-leverage scores 
turn out to be a promising tool for variable selection prior to model building, especially 
in the presence of higher-order interactions, and leverage scores prove useful for select- 
ing main effects. A combination of leverage and cross-leverage scores usually further 
improves variable selection. 


2.4.2.2 Variable Selection for High-Dimensional Survival Data 

In the following, we briefly summarize a Bayesian approach that combines the recent 
progress in the following areas in one model: 

1. analyzing time-to-event data in high dimensions 

2. variable selection in a high-dimensional setting, and 

3. integration of several data sources. 


Here, we give a brief overview. For more details on this research, see Treppmann et al. 
[691]. 

We usually encounter time-to-event endpoints or survival data in cancer studies. 
To analyze them, Cox [149] created the semi-parametric proportional hazards regres- 
sion model that considers the relation between covariates and the hazard function. 
Therefore, the Cox model has often been applied to low-dimensional data. However, 
in biological applications with genomic data, we often deal with high-dimensional 
settings with more variables than subjects. This shows the need for a high-dimensional 
survival time model. With this in mind, Lee et al. [382] developed a Bayesian version of 
the Cox model for right-censored survival data, where high dimensions are treated by a 
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regularization of the regression coefficient vector via Laplace priors. Another contribu- 
tion to survival prediction for high-dimensional data based on the Cox model in this 
volume is presented by Rahnenfiihrer et al. in Section 2.5. 

A second aspect resulting from a huge number of variables is the need for variable 
selection. In a high-dimensional setting, common methods like best subset selection 
as well as backward and forward selection prove to be unsuitable for various reasons. 
Bayesian techniques provide a good alternative to search stochastically over the entire 
parameter space, especially as they implicitly address model uncertainty. One example 
is the stochastic search variable selection (SSVS) by George and McCulloch [224], an 
approach commonly used in regression analyses. It is a flexible and intuitive procedure 
that uses data augmentation for the selection task and includes shrinkage. 

Moreover, the Bayesian setting provides a way for incorporating additional data 
sources. The interest in such integrative statistical analyses is growing steadily, as 
technological progress makes it possible to collect different genome-wide data system- 
atically. The integration of more than one information source can lead to an improve- 
ment in the performance of risk prediction models and, therefore, to a more detailed 
understanding of the biology of diseases. For a recent overview of integrative Bayesian 
analyses in molecular biology, see Ickstadt et al. [295]. 

In conclusion, our approach combines the variable selection procedure of George 
and McCulloch [224] with the Cox proportional hazards model of Lee et al. [382] in one 
Bayesian model and integrates a further data source by means of an informed prior. 

As mentioned, for right-censored survival data in high dimensions with the number 
of variables p being (much) larger than the number of subjects n, Lee et al. [382] de- 
veloped a Bayesian variant of the semiparametric proportional hazards model A(t|x) = 
ho(t) - exp(x7B) by Cox [149]. Here, ho(t) denotes the underlying baseline hazard func- 
tion, t the survival time of a person with covariable vector x = (x,..., Xp)", and 
B = (B1, .--, Bp)? the vector of regression coefficients. By a finite partitioning of the time 
axis, O < So < S1 < S2 <<... < Sj With Ss; > tr, Vr = 1, ..., n, such that the breaks are 
points where at least one event occurs, and the last event lies inside the last interval, 
Lee et al. [382] obtain the following grouped likelihood introduced by Burridge [110]: 


J 
L(DIB, h) x Il («x ( — hj: 5 ext) 


jel lE(Rj-D;) 
. II (1 -— exp (-h; . exp(xf)) ) (2.11) 
šEDj 
hj ~ T(aoj - 40;-1, Co) 


with aoj = Co: H (s;) and co > 0, j=1,...,J. 


In this context, D = {(x, Rj, Dj): j = 1, ..., J} denotes the observed data, with Rj being 
the risk set and Dj being the event set regarding the j" interval. In case of choosing a 
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Weibull distribution for the monotonously increasing function H“ (t) (with H*(0) = 0), 
we obtain H*(t) = yo : t® with hyperparameters (no, Ko). 

For the variable selection, we apply the SSVS procedure of George and McCulloch 
[224]. Under the assumption that the variances of the regression coefficients of variables 
included in the model are equal, the prior distribution for 6; conditioned on y; has the 
following form: 


Bilyi ~ (1 -y;) N(0, T°) + yi: N(0, c3 T°), i=1,..5p, (2.12) 


with small tT? > O and Ce > 1. Following the concept of data augmentation (Tanner and 
Wong, 1987 [673]), the indicator vector y states whether the associated variables are 
included in the model or not. 

Inference is based on Markov Chain Monte Carlo (MCMC) algorithms. For updating 
the full conditional distribution P(B;|B_j, y, h, D) with 


Bai = (Bis «++» Bizi Biss +++» Bp)? (2.13) 


we use the special random walk Metropolis-Hastings method with adaptive jumping 
rules proposed by Lee et al. [382]. Moreover, the conditional distributions P(yi! = 
1)p'*, o%, y4) with y% = (yt, ..., yi, y4, ... yi)? are derived by means of the 
Bernoulli distribution. The full conditional distribution P(hj|h_;, B, y, D) with hj = 
(hy, ..hħi-1, Nig, yA yt is approximable by a Gamma distribution. To update £, y and 
h iteratively according to the full conditional distributions described above, a Gibbs 
sampler is appropriate. 

In addition to a simulation study, which will not be discussed here, we applied 
our method to a dataset of glioblastoma multiforme (GBM) patients, retrieved from 
the Cancer Genome Atlas [471] database. In adults, glioblastoma is the most frequent 
and the most rapidly growing brain tumor. The used dataset comprises 210 patients 
and includes survival and gene expression data as well as associated copy number 
variation (CNV) data, which are used to construct an informative prior. We restrict 
the analysis to the 1000 genes that show the greatest variance in their values. The 
underlying assumption is that genes with low variability are probably not well suited to 
distinguish between patients with a good and patients with a poor survival prognosis. 
We divide the dataset into a training dataset for the model fitting of 140 patients anda 
test dataset for the evaluation of 70 patients. 

For the analysis, we assume a prior expected number of selected variables of k = 20. 
We construct an informative prior such that the prior inclusion probability ne of the 
it variable is proportional to its standard deviation of^Y of the copy number variation 


i 
data for the associated genomic region across patients. Thus nm is defined as 


CNV 

CNV _ 0; 

me “SP Gch? 
jel “j 


i=1,...,p. (2.14) 


For comparison purposes, we use the non-informative prior 7 = (k/p, ..., k/ pt, 
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In the course of the evaluation, we consider the posterior means and standard deviations 
of the parameters f} and y. To decide which variables to select, we first determine the 
mean model size pm by rounding the average of selected variables per iteration. Then 
we select the pm variables with the highest selection probability. 

We conduct a combined analysis of five Markov chains, each with a length of 
100 000 of which the first 10 % are removed as burn-in. 

The simulation result from Table 1 in Section 2 in Treppmann et al. [691] shows that 
the posterior selection probabilities differ greatly depending on whether the informative 
or uninformative prior was used. Only three genes are among the pm = 10 (uninfor- 
mative) or pm = 9 variables (informative prior) with the highest posterior selection 
probability in both cases. 

To evaluate the goodness of the prediction, we consider prediction error curves 
and determine the integrated Brier score [241, 589] in comparison with the Kaplan- 
Meier estimator without any covariates (reference approach). This shows that in the 
case of the informative prior, our model improves the prediction performance relative 
to the reference approach, while this is not observed for the uninformative prior. The 
examination of trace plots to the simulated MCMC chains indicates that the chains move 
quickly into desired regions of the model space and exhibit good mixing performance. 

Both in our application to glioblastoma data and in our simulation study, we have 
shown that the inclusion of a second data source has distinct potential for improvement 
in terms of prediction quality. However, this is only the case if the second data source 
provides an informative prior. This requires that variables with an increased prior 
selection probability tend to be truly associated with the response. However, since this 
is usually not known in practice, a comparison with the model using an uninformative 
prior is always appropriate. 

Due to the Bayesian modeling, we obtain full inference, especially concerning 
the posterior selection probabilities. The joint analysis of all variables offers the great 
advantage that posterior selection probabilities of whole sets of variables can be con- 
sidered. One example would be a group of genes that has been shown to be particularly 
influential in previous studies. 

Since in MCMC approaches there is usually a trade-off between the computational 
cost and the accuracy of the results, efficient programming is of particular importance. 
We recently re-implemented our approach in Python, eliminating some inefficiencies 
of our previous implementation of the algorithm in R. 


2.4.3 Merge & Reduce for Statistical Models 


Datasets with a massive number of observations have become more and more common, 
making scalability a major challenge for modern data analysis. For many statistical 
methods, these amounts of data lead to an enormous consumption of resources. A 
prominent example is linear regression, an important statistical tool in both Bayesian 
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and frequentist settings. On very large datasets, regression analysis becomes increas- 
ingly demanding with regard to running time and memory consumption, making the 
analysis tedious or even impossible. We propose a method called Merge & Reduce to ad- 
dress these scalability limitations in regression analyses. Merge & Reduce is well known 
in computer science and has mainly been used for transforming static data structures 
to dynamic ones with little overhead (Bentley and Saxe [56]). Instead of reducing the 
data to approximate the full dataset with respect to some model, we propose using the 
statistical models derived from small batches as concise summaries. Combining these 
statistical models via the Merge & Reduce framework, we can turn an offline algorithm 
into a data stream algorithm. 

Here, we focus on streaming to deal with massive datasets, where a data stream 
algorithm is given an input stream of items, like numerical values, vectors, or edges of a 
graph at a high rate. The algorithm is allowed to make only one single pass over the data. 
As the items arrive one by one, it maintains a summary of the data that was observed 
so far in the form of, say, subsample or a summary statistic. Despite our focus on a 
streaming-setting, we stress that the Merge & Reduce scheme can also be implemented 
in distributed environments. 

Our contribution is to develop the first Merge & Reduce scheme that works directly 
on statistical models. We show how to design and implement this general scheme for the 
special cases of (Bayesian) linear models, Gaussian mixture models, and generalized 
linear regression models in Geppert et al. [227]. Here, we will restrict ourselves to 
Bayesian linear models and evaluate the resulting streaming algorithm on simulated 
datasets. We demonstrate that we obtain stable regression models from large data 
streams and that the Merge & Reduce schemes produce little overhead. 


2.4.3.1 Method 

In our Merge & Reduce method for statistical data analysis, we iteratively load as many 
observations into the memory as we can afford. On each of these blocks, we apply a 
classical algorithm to obtain, say, the parameters of a statistical model, some (sufficient) 
statistics or a summary of the presented data; in short, a model. Models are merged 
according to certain rules, eventually resulting in a final model that combines the 
information from all subsets. Merge & Reduce leads to stable results, where every 
observation enters the final model with equal weight, thus ensuring that the order of 
the data blocks does not bias the outcome toward single observations. 

In order to design a streaming algorithm for a specific statistical analysis task, we 
need to choose an appropriate model as a summary statistic for each block of data. The 
two main ingredients that we need to implement for this particular choice of a model 
are called merge and reduce. 

1. Let M,, M be the models obtained from the analysis of data blocks B,, B2, then 
the output of merge(M,, M2) is a model M for the union B, U B- of the input data 
blocks. 
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2. Let M be a model for data block B that has become too large, i.e., |M| = 2T for 
some threshold T (e.g., by repeated merge operations), then reduce(M) computes 
a model M of size |M] < T for the same block B. 


Summarizing the statistical analysis and implementing the Merge & Reduce functions 
on statistical models are not trivial undertakings. The approach heavily depends on 
the statistical method employed and on the representation chosen to store the model. 
Here, we present the novel, general concept and discuss how to design the Merge & 
Reduce functions for the example of linear regressionindexRegression! linear in the 
Bayesian setting. 

We first describe how the Merge & Reduce functions interact in a structured way 
to perform the statistical analysis task on the data block-by-block while maintaining 
a model for the whole subset of data presented so far. The data structure consists of 
L = O(log(n/n,)) = O(log n) buckets for a sufficiently small block size ny, to fit into the 
main memory of the machine. The buckets store one statistical model each. Initially, 
they are all empty. One bucket, the working bucket Bo, is dedicated to store the model 
for the current batch of data, while each of the other buckets B; stores one model on 
its corresponding level i € {1,..., L} of a binary tree structure formed by the merge 
operations, see Figure 2.22. The data structure works in the following way. First, we 
read one block of data of size n}. We perform the statistical data analysis on this block 
only. The model that summarizes the analysis is stored into Bp. We begin to propagate 
the model in the tree structure from bottom to top by repeatedly executing Merge & 
Reduce operations on each level. If B4 is empty, then we just copy the model from Bo 
to Bı and empty Bo. Otherwise, we have two models that are siblings in the tree, so 
we merge the two into By, empty Bı and proceed with B2. Again, if it is empty, the 
model from Bo is stored in B, and the propagation terminates. Otherwise, we have 
two siblings that can be merged and propagated to the next higher level in the tree. 
In general, the propagation stops as soon as the bucket on the current level is empty. 
When this happens, the update of the data structure has completed, and we can move 
on to reading and analyzing the next block of input data. This is repeated until the end 
of the stream. Notice that except for the additional working bucket, we need to store at 
most one bucket on each level at a time, since two siblings are merged immediately. 

A linear regression model is given by 


Y=XBte, (2.15) 


where Y € R” is the dependent variable and X € R”“@ is the design matrix containing 
the observations x1, ..., Xq of the independent variables. The error term e is assumed 
to be unobservable and is usually modeled by a normal N(0, o¢) distribution, B € Ri 
is the unknown parameter vector of regression coefficients that we wish to estimate. In 
Bayesian regression, } is assumed to be random and follows a distribution. Interest 
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Fig. 2.22: Illustration of the Merge & Reduce principle taken from Geppert et al. [227]. The data is 
presented in the form of a stream and subdivided into Blocks 1 through 6 of equal size. Models Mı 
to Mu are numbered in order of their creation throughout the execution. Arrows between models 
indicate the Merge & Reduce operations. Sibling models are deleted right after their parents’ cre- 
ation. Thus, only one model is stored on each level, i.e., in buckets B; to B3, at a time. The working 
bucket Bo acts on all levels, eventually holding the final model after postprocessing at the end of the 
stream. 


centers around the posterior distribution of the parameters that can be written as 


Ppost(B|X, Y) x L£(Y|X, p): Ppre(B). (2.16) 


Thus the posterior distribution ppost can be seen as the product of the prior distribution 
Ppre and the likelihood function £ of the parameters. In many cases, the posterior distri- 
bution cannot be obtained directly by analytical means, but must be approximated by 
some sampling approaches. The most popular method is to apply Markov Chain Monte 
Carlo (MCMC) random sampling, but this requires a very large number of simulations. 
When the sample size is also very large, such sampling simulations are demanding in 
terms of computer memory and computing speed. 

We will apply the Merge & Reduce method to Bayesian linear regression. Since we 
use the MCMC method to approximate the posterior distributions of the parameters 
B, for each estimated parameter Pj» we collect the mean Xj, the median X 5, the lower 
and upper quartiles X25 and X,75, 2.5% and 97.5 % quantiles x,925 and X.975, and the 
standard deviation oj of the posterior distributions. Then, the collected statistics can 
be summarized as 


S= (x1, Ssk , Xd» Žp,1> ewa ,Žp,d» 01, PENN Oa), (2.17) 
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where p € {0.025, 0.25, 0.5, 0.75, 0.975}. We use a weighted average of the statistics 
to merge these statistics vectors, and the weights are dependent on the size of the 
sample. 


2.4.3.2 Simulation and Results 

The main parameters used to generate datasets for the simulation study are the number 
of observations n, the number of variables d and the standard deviation o¢ of the error 
term e. Different numbers of observations per block n, are also chosen. The setup of 
the simulation study and the parameter values are similar to those in the simulation 
study in Geppert et al. [226]. 

In this study, all assumptions of a linear regression model are met. A varying 
fraction of the variables has an influence (large or small) on the dependent variable, 
while the remainder is not important for the explanation of Y. 

We evaluate the Merge & Reduce approach by calculating the squared Euclidean 
distances e2, of the statistics in S, specified in Equation 2.17, between the original model 
and the Merge & Reduce model for all simulation models m = 1,..., M. Ifthe Euclidean 
distance is close to 0, then the Merge & Reduce approach approximates the results of 
the original model accurately. We see in Figure 2.23 that, as the ratio Ta increases, the 
difference between the medians, obtained from the Merge & Reduce approach and from 
the Bayesian original model, evaluated by their squared Euclidean distance, is quite 
small. The majority of the squared distances is close to 0. For more details on the results 
of other summary statistics of the posterior distributions, see Geppert et al. [227]. 


2.4.3.3 Conclusion and Outlook 

Merge & Reduce is suitable for Bayesian regression models. The goodness of the ap- 
proximation depends on the ratio of observations per block and variables “¢ and the 
goodness of fit of the original model. The first condition can easily be controlled by the 
data analyst, especially in a setting with large n. The second condition may require 
care when building the model. 

For the implementation of the Merge & Reduce approach in principle, it is only 
necessary to choose an appropriate statistical model and to implement Merge & Reduce 
operations for this specific type of model. However, the design of such operations is 
not trivial in general. In particular, we showed in Geppert et al. [227] how to design 
such operations for the case of Bayesian linear regression, Gaussian mixture models, 
and generalized linear models. Implementing the Merge & Reduce approach for the 
Bayesian Cox model is a next step in future work. 
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Fig. 2.23: The figure, taken from Geppert et al. [227], shows a scatterplot of the effect of obser- 
vations per block per variable a on squared Euclidean distances e2, (m = 1,..., M) between 
posterior medians for the Bayesian Merge & Reduce approach. The x- and y-axes are drawn on a 
logarithmic scale, and observations are drawn as partially transparent points: gray points mean 
single observations; black points represent multiple observations at roughly the same location. The 
vertical dashed line is at 0.1. 


2.4.4 Overall Conclusion 


We are concerned with several issues that arise when dealing with insufficient compu- 
tational resources. In this contribution, we show ways of reducing high-dimensional 
data and simplifying complex models in the context of biomedical applications. 

In the first two sections, we introduce variable selection strategies in two different 
high-dimensional datasettings. In the first scenario, we describe a variable importance 
measure approach for genetic SNP data, employed prior to model building. In the 
second scenario, we formulate a variable selection strategy for high-dimensional time- 
to-event data integrating several data sources in the context of a Bayesian Cox model. 
In the third section, we use the Merge & Reduce approach for massive data to analyze 
and build statistical models on small batches separately and recombine them in order 
to cope with the problems of limited running time and insufficient memory. 

All approaches in this contribution improve the efficiency of statistical learning 
when facing the limited computational resources of host devices. They reduce the com- 
putation time, energyconsumption, and memoryusage of learning individual models 
without losing the high accuracy of the model. This, in turn, allows us to analyze more 
complex models on massive data with an efficient use of computational resources. 
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Abstract: Survival analysis comprises statistical methods for time-to-event data. The 
main prediction tasks include the estimation of the influence of prognostic factors 
for, say, medical treatments, and the modelling and prediction of survival times using 
regression models. In recent years, in molecular medicine, many omics technologies 
have been developed, generating complex high-dimensional genetic data that can be 
used as predictors. 


For such complex tasks, the selection of the best prediction method out of a large set 
of candidates, along with potential feature selection and hyperparameter optimization, 
represents an optimization task under resource constraints. In this section, approaches 
for tackling the model selection problem in survival analysis are presented, specifically 
using Bayesian optimization and addressing feature selection for high-dimensional 
data. 


2.5.1 Introduction 


In medicine, times to events are compared between groups to estimate the effect of prog- 
nostic factors and medical treatments, and regression models are used to model and 
predict survival times of cells, animals, or patients. For two decades, high-dimensional 
genetic and genomic variables have been generated and analyzed as potential predic- 
tive and prognostic factors in biological and medical scenarios. The very large number 
of variables requires developing and using tailored methods to describe the complex 
relationships. Popular modeling approaches are based on penalized regression meth- 
ods, gradient boosting methods, survival trees, and survival forests, often combined 
with suitable feature selection methods. 

In recent years, machine learning approaches were used to find the best survival 
method from a large set of candidates. Efficient approaches are required, since it is 
crucial that runtimes especially in resampling scenarios with many repeated estimation 
tasks be kept short, especially for complex high-dimensional predictor settings. In 
CRC 876, we applied modern Bayesian optimization (BO) [303] techniques to efficiently 
identify the best survival prediction method, by modeling the relationship between 
the choice of the survival prediction method (as well as its hyperparameters) and its 
performance or quality, using so-called surrogate functions. On several lung cancer 
datasets the new approach was superior to established benchmark approaches [367, 
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369]. After a short introduction into the analysis of time-to-event data in Section 2.5.2, 
model selection for survival analysis is discussed in Section 2.5.3. To solve this task, 
various R packages were implemented both for the general candidate selection and for 
parallelization, as presented in Section 2.5.6. 

The same principle idea was used in a scenario, where survival predictions for a 
specific cancer dataset are to be improved, by adding data from similar datasets. This is 
a frequent situation in, say, cancer survival analysis, where patient numbers in clinical 
trials are limited due to ethical, financial, and administrative reasons, but similar 
treatments are applied, e.g., in other clinical centers. However, simply adding similar 
datasets to the one of interest potentially deteriorates the predictions, due to structural 
differences between the datasets. Instead, one can estimate dataset-specific weights 
that determine how strong these datasets should be considered. In CRC 876, we applied 
BO to determine such optimal weights for inclusion of the respective observations 
in appropriate weighted likelihood-based modelling approaches [531], as shown in 
Section 2.5.4 below. In two other projects related to feature selection, we developed 
improved methods based on two-fold subsampling schemes [383] and benchmarked 
filter methods against each other for high-dimensional data [95]. These analyses are 
described in Section 2.5.5. 


2.5.2 Analysis of Time-to-Event Data 


Survival analysis, also called event-time analysis, deals with the analysis of times to 
certain events and is used in many application fields. In medicine, the overall survival 
(OS) of patients is often of direct interest. Alternatively, Progression-Free Survival (PFS) 
is frequently analyzed, which includes Event-Free Survival (EFS) and recurrence-free 
survival. An important property of survival data is that they are often not fully ob- 
servable, such as when patients in a clinical trial have not yet experienced the event 
of interest at the time the trial ends and the data is analyzed. This situation is called 
right-censoring, since for patients without an observed event, the survival time must 
be greater than or equal to the time until the end of the study. Depending on the type of 
missing information, many other censoring mechanisms are defined and considered in 
the analysis techniques. 

Specialized statistical methods for analyzing survival data have been developed 
and are widely used in literature and in practice. Most prominent are the Kaplan- 
Meier estimator for estimating survival curves under right-censoring, the log-rank test 
for comparing survival between patient groups, and the Cox proportional hazards 
model [149] for estimating survival dependent on a number of explanatory variables, 
such as tumor size or age in oncological studies. 

In regard to the evaluation of performance, seemingly obvious approaches lead to 
wrong interpretations of the results. For example, simply predicting the event indicator 
that indicates if a patient has survived until the end of the study, neglects the different 
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time intervals that have passed since the patients entered the study. By contrast, meth- 
ods based on hazard rates that model the instantaneous failure rates at different time 
points can cope with the censoring mechanisms. Alternatively, parametric likelihood 
methods that consider the missing information can also be used. 

For evaluating the prediction accuracy of survival models, several suitable mea- 
sures have been developed. Concordance statistics, in particular Harrell’s C-index and 
the area under the (time-dependent) ROC curve, are the most popular measures. How- 
ever, they consider only the discrimination ability of a survival model and not the 
calibration. This means that monotone transformations of predicted values of survival 
outcomes do not change the concordance score, which limits the interpretability of the 
score for clinicians. Alternatively, the Brier score is also widely used. It considers both 
calibration and discrimination, but interpretation is also difficult. An advantage is that 
it can be related to a time-specified horizon. A discussion of these important properties 
and an adaptation of the Brier score can be found in Kattan and Gerds [313]. 

In preclinical and clinical studies, genetic factors are of interest, and modern high- 
throughput technologies provide many thousand potential explanatory variables. Even 
the popular but controversial rule of thumb that the number of events per variable 
should be at least 10 cannot be used as a basis for sample size planning. Instead, 
tailored statistical and machine learning approaches are required. Aspects to consider 
for model selection in this scenario are discussed in the next subsection. 


2.5.3 Model Selection for Survival Analysis 


Model selection in survival analysis, compared with model selection in classical ma- 
chine learning setups such as regression or classification, presents numerous additional 
challenges. 

First, instead of having to solve a learning task with many observations (e.g. pa- 
tients) and comparatively few variables, in survival analysis we often face a low sample 
size problem. Even worse, with the rise of omics technologies, thousands to hundreds 
of thousands of genetic features need to be included in the analysis to be able to iden- 
tify the most important genes. However, most machine learning or statistical learning 
algorithms have been designed and heavily optimized for a large sample size n, and 
usually have worse than quadratic runtime in the number of features p. For this reason 
alone, we often face runtime issues in the n < p scenario. 

Second, a dual objective is often pursued, and the predictive performance of the 
models is not the only target criterion. Instead, it is desired to identify the important 
features (clinical covariates, genetic dispositions, or genes) in the given medical context. 
A good predictive performance often ensures that the model describes the data ina 
meaningful way, which is the prerequisite for extracting a set of important features. This 
restricts the analysis to models that either come with an embedded feature selection or 
models that still work reasonably well after a feature filter has been applied. 
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Third, all performance measures in survival analysis require a large enough test set to 

yield performance values that lead to reliable statements. For example, the popular 

C-index mentioned above assesses the ranking of the predictions for survival in the test 

set by comparing it with the true, observed ranking of survival times (while correcting 

for censoring). Obviously, having too few observations in the set results in a high 
variance of the performance estimator. Since usually only a few hundred observations 
are available in total per dataset, the number of repetitions must be increased during 
resampling in order to account for the larger variance. Of course, this exacerbates the 
runtime problems one is already facing. 

These points taken together form a hard tuning problem with the following charac- 
teristics: 

1. The models form a black box from the perspective of the tuner, as there are no 
known derivations. Therefore, the optimization problem itself is also called “black 
box”. 

2. To assess the predictive performance of a hyperparameter configuration 0 and its 
resulting model, the data needs to be split into a training set and an independent 
test set. This introduces stochastic components into our tuning problem at the 
latest (some learners are non-deterministic either way). 

3. The search space spanned by the hyperparameters to be tuned usually includes 
both numerical and categorical variables. This precludes the use of many tuners 
derived from discrete and steady optimization. 

4. Each model fit is potentially resource demanding, in terms of computational time, 
memory requirements, or communication costs. The key word here is “potentially”. 
Some models, e.g., a simple Cox model augmented with an aggressive feature filter, 
can easily be fitted in less than a minute even with n = 200 observations and p = 
10° variables (features). Other learners, such as support vector machines, require 
a complete day for the same task on the same hardware while simultaneously 
consuming several orders of magnitude more memory. Obviously, the resource 
requirements are very heterogeneous, which should be taken into account during 
the mandatory parallelization. 

5. Last but not least, during hyperparameter optimization, we generally have to deal 
with an additional type of censoring (besides the censoring of the survival times): 
It is not unusual that the learner implementations crash from time to time due, 
say, to numerical problems. And since the tuning is usually distributed on larger 
computation sites with shared and contested resources, computational jobs can hit 
a wall time and be killed by a scheduler. In such a case, the missing performance 
score must be imputed with a number to continue with the tuning, and it is unclear 
which value to choose. 


Over the last decade, special strategies addressing the difficulties of hyperparameter 
optimization have emerged. An overview is given by Bischl et al. [68]. Roughly speaking, 
hyperparameter optimization is about finding the configuration @ of a model, which 
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leads to the best predictive performance (evaluated on an independent test set). If 
the evaluation of a single configuration is sufficiently expensive with regard to com- 
putational resources like runtime, every evaluation counts, which also means that 
rather wasteful optimization methods are not applicable. This applies, among others, 
to Evolutionary Algorithms (EAs). EAs usually require many hundreds of configurations 
before being able to make the first targeted decisions. Instead, a tuner that optimizes 
more aggressively from the start is needed. 

One tuner that addresses all the problems of the outlined expensive black-box 
optimization problem is iterated F-racing [421]. The basic idea of F-racing is to race a 
population of configurations against each other, and to eliminate in each iteration can- 
didates that are underperforming based on a Friedman test. Iterated F-racing extends 
this approach by assuming a probability distribution over the search space. This distri- 
bution gets updated iteratively so as to be centered around some elite configurations. 
We applied this tuning approach to a broad range of survival pipelines (consisting of 
the feature filter and the survival model) [369]. The benchmark considers 12 different 
datasets of four breast cancer cohorts where each dataset consists of clinical and/or 
genetic variables (features). The architecture of the pipeline, i.e. the choice of filter 
and the choice of model, is encoded as virtual hyperparameters passed to the tuner. 
This way, dominated combinations of filters and models are getting fewer evaluations, 
giving the tuner more opportunity to exploit hyperparameters of more promising com- 
binations. As a baseline, four reasonable approaches that are popular in practice but 
are arguably less computationally intensive have been evaluated. To the best of our 
knowledge, this was the largest benchmark of survival models up to that point. In 
comparison with the baseline approach, the tuning yields significantly better results 
in terms of the C-index. The caveat is the effort to archive the results: with more than 
10 000 hyperparameter evaluations, the tuning cannot be applied easily on new data 
or cohorts. 

Another tuner which perfectly fits the requirements of hyperparameter optimization 
in survival analysis is Model-Based Optimization (MBO). Its performance has been 
verified by Lang [367] where the benchmark study from Lang, Kotthaus, Marwedel, 
Weihs, Rahnenfiihrer, and Bischl [369] has been extended, with more datasets, more 
filters, more models, and more time budget. 

Figure 2.24 visualizes the survival probability in the included cohorts. Although the 
studies are all on lung cancer, and share the same set of clinical and genetic features, 
they differ considerably with respect to survival times. This is a frequently observed 
characteristic, and makes the careless merging of the datasets into a larger dataset with 
more observations inappropriate. As a result, in this domain, it is usually not possible 
to configure a single model to perform sufficiently well on all cohorts. Instead, for each 
cohort, tuning starts from zero. One goal of the analysis was to thin out the portfolio 
of methods to consider for a new tuning run. If, e.g., only two pipelines consisting 
of a filter and a model have to be tried, the computational effort required for tuning 
is significantly reduced, rendering the tuning for new cohorts on a single computer 
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Fig. 2.24: Plot of Kaplan-Meier estimators including confidence bounds for the survival time S(t), 
stratified by the cohorts that are included in [367]. In the Kaplan-Meier plots, the time t is plotted 
against the estimated proportion of patients still alive at t. Lines represent survival curves of the 
seven cohorts. A vertical drop in a curve indicates an event, and a plus on a curve means that an 
observation was censored at this time. 


possible and therefore applicable for practice. This has been systematically analyzed 
by Lang [367]. Parts of the results are summarized in Figure 2.25. Additionally, the mean 
ranks of filters and learners have been analyzed and revealed the following important 
take-home messages in the context of the datasets analyzed: 


If one base learner has to be chosen, random survival forests perform best on 
average. 

One of the most popular approaches due to its embedded feature selection—fitting 
a Cox proportional hazards model with a LASSO penalty (L;)—performs the worst 
on average. 

Tuning over multiple base algorithms jointly with MBO results in the best perfor- 
mance on average. 

Tuning each pipeline individually and picking the best performing pipeline (ap- 
proach BenchOpt in Figure 2.25) in a second step is not only a waste of computa- 
tional resources; it also leads to overoptimistic performance estimates. As each 
tuning run is stochastic, and the pipelines often perform comparably well, picking 
the best configuration is determined by the stochastic noise to some degree. This 
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Fig. 2.25: Resulting average C-index on independent test sets of multiple algorithms, stratified 

by cohort. All base algorithms are individually tuned together, jointly with the choice of filter and 

the filters hyperparameters. MBO tunes over all filters and algorithms simultaneously. BenchOpt 
expresses the resulting performance on an independent test set after picking the best performing 
base algorithm on the training data [367]. 


is in particular a very alarming result, as the described manual benchmarking is 
common practice. 


The heterogeneous runtimes (or more general, the heterogeneous resource demands) 
have been addressed by Richter, Kotthaus, Bischl, Marwedel, Rahnenführer, and Lang 
[530] and Kotthaus et al. [346]. Instead of fitting only a single surrogate model, guiding 
the optimization to areas with the best predictive performance, multiple surrogate 
models are fitted in each iteration. On the one hand, the usual surrogate based on the 
observed predicted performance is calculated. On the other, one or more surrogate 
models for computational resources are fitted, e.g., one surrogate for the runtime and 
one surrogate describing the memory consumption. As a result, we can query the 
models for the estimated predictive performance and the estimated resource demands 
for all hyperparameter configurations. All the information is fed into a scheduler that 
selects a subset of the configurations and maps them to multiple CPUs or workers 
based on their priority (as derived from the estimated predicted performance) while 
minimizing the idle times (based on the estimated runtime). 
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2.5.4 Weighted Subgroup Selection for Survival Analysis 


Obtaining a reliable prediction model for a specific cancer subgroup or cohort s* is often 
difficult due to a limited sample size and, in survival analysis, due to potentially high 
censoring rates. Sometimes similar data from other patient subgroups is available, e.g., 
from other clinical centers. Simple pooling of all subgroups can decrease the variance 
of the predicted parameters of the prediction models, but also increase the bias due to 
heterogeneity between the cohorts. 

Different approaches exist to improve the predictive quality by including data 
from other patient subgroups in a weighted fashion. One possible way is to include 
one further weighted subgroup, as proposed by Weyer and Binder [726]. Alternatively, 
individual weights for each patient can be estimated from the training data, as described 
by Bickel et al. [63]. The idea is that weights match the joint distribution of the combined 
data to the distribution in each subgroup, such that a patient who is likely to belong to 
the target subgroup receives a higher weight in the subgroup-specific model. Weights 
correspond to the conditional probability of belonging to the target subgroup s* given 
the observed covariates and outcome divided by the prior probability for s*. The former 
is estimated from the training data by multi-class classification, and the latter by the 
relative frequency of s*. 

The goal is to optimize the predictive performance of our model for the target 
subgroup s“. Including data from additional subgroups in the training data should 
increase the predictive performance of the target subgroup. Accordingly, this forms a 
combinatorial optimization problem where additional subgroups must be chosen to 
maximize the predictive performance. 

However, completely abstaining from using certain subgroup data seems overly 
drastic since there might be relevant information contained in each additional subgroup 
data. Luckily, most machine learning methods and also those that can be used for 
time-to-event data allow observational weights. This allows us to give a low weight to 
observations that do not represent our problem. However, finding an optimal weight 
for each observation is exceedingly complex. Instead, we introduce subgroup weights 
as presented by Richter et al. [531]. The observation weight is then determined by 
the subgroup membership of each observation. This enables the inclusion of certain 
subgroups with a specific weight. Hence, including additional data in a weighted way 
might lead to a better solution than the binary choice of including a subgroup with full 
weight or not at all. 

By introducing subgroup weights, we relaxed the combinatorial problem into a nu- 
merical optimization problem. However, setting those subgroup weights in an optimal 
way remains a difficult optimization problem. First, each additional subgroup leads to 
a further weight parameter that has to be chosen optimally. Second, the evaluation of 
a weight parameter combination can take fairly long, since the datasets themselves 
tend to be rather large, especially when they include high-dimensional genetic mea- 
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surements in the scenario of survival analysis. In this case, it becomes infeasible to try 
out many weight parameter combinations in order to find an optimal one. 

Therefore, we can apply state-of-the art optimization methods for expensive black- 
box problems such as MBO (model-based optimization) in order to find the optimal 
subgroup weights without the cost of having to evaluate hundreds of different weight 
parameter combinations. For our evaluation, we optimize the subgroup-specific weights 
w®) in the weighted Cox model. Note that in Section 2.5.3 a study was reported where 
random survival forests performed on average better than fitting a Cox proportional 
hazards model with a LASSO penalty. However, here we use the much more frequently 
used penalized regression approach to evaluate the potential improvement due to the 
weighted analysis. 


Weighted Cox Model Assume the observed data of the patient i consists of the tuples 
(ti, ôi), the covariate vectors x; = (Xj1,.--,5 Xip) € R”, and the subgroup membership 
Si € {1,..., S} with S being the total number of available subgroups, and i = 1,...,n. 
ti denoting the observed time of patient i, the minimum of the event time, and the 
censoring time. 6; indicates whether a patient experienced an event (6; = 1) or was 
(right-)censored (6; = 0). As mentioned above, one of the most popular regression 
models in survival analysis is the Cox proportional hazards model [149]. It models the 
hazard rate h(t|x;) of a patient at time t as 


Pp 
h(t|x;) = holt) - exp(B x) = ho(t)- exp | XO Bixy | > 
j=l 


where ho(t) is the baseline hazard rate, and B = (f1,..., Bp) is the unknown parameter 
vector. The parameters are estimated by maximizing the partial log-likelihood [326, 
Chapter 8.3]. A version of the partial log-likelihood that uses observation weights is 
introduced in [726]: 


n n 

1(B) = b9 Ow; (+ x;-In D Leret) Wk EXP (B Xx) J) . (2.18) 
i=1 k=1 

Instead of an individual weight for each patient, we introduce an individual weight 


for each subgroup. Therefore, we assign the same weight to each patient of the same 
subgroup: 


1, ifs; = s* 
Wee . (2.19) 
w®), ifs;=g, ge {1,...,S}\s 


where w®) € [0, 1] is the specific weight for the subgroup g, and s* is the subgroup for 
which we want to obtain predictions. Patients for subgroup s* enter with full weight 1 
in the prediction model. 


80 — 2 Health / Medicine 


Standard subgroup analysis is based only on the patients in the subgroup of interest 
(target subgroup s*), which corresponds to w; = 0 for all patients not belonging to s*. A 
combined model that pools patients from all subgroups corresponds to w; = 1 for all 
patients. 

In high-dimensional settings where the number of covariates p is typically much 
larger than the sample size n, standard maximum likelihood cannot be used for pa- 
rameter estimation, since it does not result in a unique solution. Therefore, we add a 
LASSO penalty [677] to the partial log-likelihood. Lasso regression performs feature 
selection and yields a sparse model solution. The resulting maximization problem of 
the penalized partial log-likelihood is given by 


Pp 
B- argmax (B) -A-S~ |B; 


j=l 


The LASSO penalization parameter A is optimized through an internal 10-fold cross- 
validation. 


Evaluation We are interested in maximizing the predictive performance for a target 
subgroup s*. The predictive performance of the weighted Cox model is evaluated using 
the C-index. For the evaluation of the model for a given target subgroup s*, a dataset 
that contains S subgroups, and a subgroup weight vector w = (w), sks wS-D), we 
conduct a modified 10-fold cross-validation. The validation data should only contain 
the target subgroup, because we are only interested in the predictive performance on the 
target subgroup. In order to obtain the 10 necessary splits for the cross-validation, we 
only divide the data of the target subgroup into 10 chunks. To obtain the prediction for 
one chunk, all remaining 9 chunks plus all observations from the additional subgroups 
are combined to the training dataset. By performing the modified cross-validation, 
we obtain an estimation on the C-index for the given combination of dataset, target 
subgroup and subgroup weight vector. 

Now, the goal is to find the subgroup weight vector that maximizes the C-index. 
This optimization problem can be solved with MBO, with a search space [0, 1]°~1 that 
directly maps to the weight vector. The acquisition function that selects the next weight 
vector to be evaluated should take into consideration that results are not deterministic. 
Therefore, we proposed the augmented expected improvement [288], which is well 
suited for such scenarios. For the Gaussian process regression within the Bayesian 
optimization, we proposed the Matern 3/2 kernel with an estimated nugget effect to 
account for the noisy response of our objective. 

The benefit of optimizing the subgroup weights is twofold: First, the resulting 
optimal subgroup weight vector does not only maximize the C-index for the target 
subgroup; it also allows drawing conclusions about the similarity of certain subgroups. 
If a certain subgroup weight is small, it can be assumed that this subgroup does not 
have a similar relationship between the explanatory variables and the event times as 
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the target subgroup. Second, as shown by Richter et al. [531], the predictive performance 
of the method does not deteriorate if additional subgroups are included that contain 
inconsistent data. 

Benefits could arise from using the penalization of the weights, which would allow 
researchers to completely exclude data with weights close to zero. Then the model 
becomes computationally cheaper and possibly more stable. Finally, the presented 
approach can be used for any situation where data is pooled from different cohorts and 
a machine learning method is used that supports observational weights. 


2.5.5 Feature Selection for High Dimensional Data 


The problem of feature selection is particularly important in the domain of high dimen- 
sional data, as already described in detail in Section 2.5.3. One challenging problem 
in this context is the stability of feature selection. Some learners can be restricted to 
using only a small subset of the thousands of available features, and learners can be 
combined with a feature filter to achieve the same in a generic fashion. However, the set 
of selected features is often highly variable. For example, if a Cox proportional hazard 
model is extended with an L, penalty À tuned to include only up to 20 features in a 
3-fold cross-validation, the resulting three sets of selected features can be pairwise 
disjoint. This has a simple reason: if two features x; and x2 are highly correlated, they 
are also comparably good predictors. If the model has to choose between x, and x2, a 
few observations can tip the scales in one direction or the other. If the dataset is now 
resampled and these observations are removed from the training set, the scale can 
easily swing in the other direction. This is particularly annoying because in this way 
no features can be reliably selected for a later analysis, such as a biological analysis. 

Lee et al. [383] tackle this problem in two ways: First, a special extension to the 
LASSO regression is used. The preconditioned LASSO [495] is a two-step procedure 
designed to address the problems of high bias in LASSO estimates. Second, the pre- 
conditioned LASSO is embedded in a two-fold subsampling procedure to improve the 
stability of the feature selection via model averaging and extra shrinking of covariates 
based on the selection probability in the inner subsampling. 

The approach has been applied to datasets on neuroblastoma, lung adenocar- 
cinoma, and breast cancer. Both predictive performance (measured by the C-index) 
and stability (measured by the Jaccard index and the Kuncheva index) are improved. 
However, the comparison with popular univariate selection methods does not provide 
a clear picture. 

Another take on this topic was presented by Bommert et al. [95], where more 
than 20 filter methods are benchmarked against each other for high-dimensional data. 
Although this work is based on classification data, there is no reason why the core 
results should not be transferable to survival problems, and confirming this is currently 
work in progress. One key result is shown in Figure 2.26. There are clear groups of 
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Fig. 2.26: Rank correlations between the feature selection order, for all pairs of a large set of filter 

methods, averaged across several datasets by the arithmetic mean. The filter methods are ordered 
by average linkage hierarchical clustering using the mean rank correlation as a similarity measure 

[95]. 


feature filters available. Filters from the same group are expected to give very similar 
results across different datasets. Therefore, instead of including more than 20 filters 
into the machine learning pipeline, it is sufficient to thin out this portfolio to a smaller 
set. Additionally, the filters have been analyzed regarding performance and stability to 
provide general recommendations for feature filtering in high-dimensional settings. 


2.5.6 Software 


Many machine learning frameworks exist that can be conveniently employed for model 
selection or feature selection. However, most of these frameworks have a strong focus 
on classification and regression. Support for survival analysis is often not existent or 
insufficient. For proper evaluation, two frameworks have been extended to support 
time-to-event data. 

First, the R package m1r [69] has been extended with an object for survival tasks, 
including the most common survival learners and survival measures. By building upon 
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the existing infrastructure for resampling and tuning, survival learners can be tuned 
with state-of-the-art tuners such as model-based optimization. For larger tasks, i.e. tasks 
with thousands of features of genetic data, parallelization of the benchmark experi- 
ments is mandatory. The package BatchJobs [67] and its successor batchtools [368] 
provide the bridge between mlr and managed high-performance computing clusters, 
allowing to compute comprehensive benchmarks on hundreds of CPUs simultaneously. 
In this way they can define and execute exhaustive benchmark studies, such as those 
from Lang, Kotthaus, Marwedel, Weihs, Rahnenfiihrer, and Bischl [369], Lang [367] and 
Richter et al. [531]. 

The second framework extended for survival analysis is m1r3 [370], the successor of 
mlr. The extension package mlr3proba [640] provides a general framework for proba- 
bilistic regression. Compared with the survival capabilities of mlr, mlr3proba connects 
much more learners and, even more importantly, connects and implements much more 
survival measures. Additionally, mlr3proba can be embedded in the infrastructure of 
the mlr3pipelines [65] package, which provides a language to build complex machine 
learning workflows as directed acyclic graphs. mlr3pipelines is also used to convert 
and unify the many predict types of survival models: while some models return a linear 
prediction vector, others return a continuous ranking, relative risks, or a complete 
time-dependent distribution such as individual survival function estimates. mlr3proba 
provides several pipeline operators for converting between predict types or even for 
composing multiple types. 

Thanks to the unified interface of mlr3 and mlr3proba, it is directly possible to 
use state-of-the-art methods to optimize the hyperparameters of the survival meth- 
ods via mlr3tuning. Especially in the survival context, data preprocessing is often 
a crucial step. Decisions on how to configure the preprocessing should be included 
in this optimization process to obtain an unbiased estimate of the predictive perfor- 
mance. Modeling preprocessing through mlr3pipelines allows the building of a whole 
pipeline that can be resampled and optimized. To obtain an unbiased estimate of the 
performance of a pipeline identified through optimization, the whole optimization can 
be resampled, resulting in a nested resampling setting. Multi-criteria optimization is 
also supported, e.g. to tune for predictive performance, sparsity, and feature selection 
stability simultaneously by connecting the stabm [94] package. 


2.5.7 Conclusion 


The analysis of survival data requires the use of adequate statistical methodology, 
especially when it comes to accounting for missing information due to censoring. 
Corresponding methods are available and established. However, for modern high- 
dimensional data increasingly being generated today, omics data in particular, addi- 
tional challenges emerge. Estimating prediction models often requires elaborate feature 
selection and hyperparameter optimization. For this task, Bayesian optimization meth- 
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ods provide a beneficial solution. They can efficiently identify models with competitive 
prediction accuracy out of a large set of candidate models. Of great importance is the 
availability and use of software frameworks for reproducible analysis pipelines. One 
valuable example is the widely used R package m1r3, which provides efficient, object- 
oriented programming on the building blocks of machine learning, together with its 
extension package mlr3proba, which provides a general framework for probabilistic 
regression, including many popular survival models and survival measures. 
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Abstract: Proteins have manifold functions in living cells, including structural integrity, 
transport, defense against pathogens, or message transmission, to name but a few. 
Recent advances in Machine Learning appear to have solved the protein folding problem, 
i.e., how to obtain the three-dimensional functional protein structure from the amino 
acid sequence of the protein. However, proteins rarely act alone, but instead perform 
their functions together with other proteins in so-called protein complexes. Quantifying 
the similarity between two protein complexes is essential for numerous applications, 
e.g., for database searches of complexes that are similar to a given input complex. While 
similarity measures have been extensively studied on single proteins and on protein 
families, there is little work on modeling and computing the similarity between protein 
complexes yet. Because protein complexes can be naturally modeled as graphs, graph 
similarity measures may be used, but these are often computationally hard to obtain 
and do not take typical properties of protein complexes into account. We introduce 
a parametric family of similarity measures based on Weisfeiler-Leman labeling see 
"The Weisfeiler-Leman Algorithm for Machine Learning with Graphs" in Section 4.2 in 
Volume 1. Based on simulated complexes, we show that the defined family of similarity 
measures is in good agreement with edit similarity, a similarity measure derived from 
graph edit distance, though it can be computed more efficiently. Moreover, in contrast to 
graph edit similarity, the proposed measures allow for an efficient similarity search in 
large volumes of protein complex data. It can therefore be used as a basis for large-scale 
machine learning applications. 


2.6.1 Introduction 


Proteins fulfill manifold tasks in living cells, but they rarely act alone. Indeed, most 
cellular functions are enabled only when proteins physically interact with other pro- 
teins, forming protein complexes. DNA transcription is a typical example, where RNA 
polymerase II, general transcription factors, cell type specific transcription regulators, 
and mediator proteins interact. 
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Understanding protein complex formation and function is one of the big challenges of 
cell biology, and is approached by both experimental techniques and computational 
modeling. While the constituent protein sequences can be obtained from the genome, 
the computational prediction of real protein complexes from protein interaction net- 
works appears to be much more difficult [61, 643], and we presently face a lack of 
experimental datasets on verified real protein complexes. Fortunately, new experimen- 
tal technologies such as high-resolution protein-protein docking are about to enhance 

our understanding of complexes significantly in the near future [348, 490]. 

When studying biological entities such as protein sequences or protein complexes, 

a fundamental task is to define a measure of similarity between two such entities. For 
protein sequences, there is a well-established theory based on scoring matrices and 
alignment scores [496]. For protein complexes, it appears that no systematic effort to 
quantify similarity has been made yet. The purpose of the present article is therefore to 
discuss the different options for defining a similarity measure on protein complexes and 
for proposing a reasonable and computationally tractable definition of protein complex 
similarity. Establishing a similarity measure is not only important fundamentally, but 
there are many immediate applications. 

Database search In the database search problem we are given a query complex anda 
large collection (database) of complexes, and the task is to find the complexes in 
the database whose similarity to the query exceeds a given threshold. 

Comparing predictions Several complex prediction methods predict putative com- 
plexes by locating dense regions in a protein interaction network [180, 272, 424, 
497], and for comparing complexes predicted by different algorithms, it is of interest 
to compute a maximum-weight matching between the output of two algorithms, 
where the weighting is given by a similarity function. 

Summarizing and clustering When simulating complex formation based on avail- 
able knowledge such as possible interactions and interaction constraints, it is 
helpful to aggregate the simulation output to focus on frequently seen or typi- 
cal complexes, ignoring small differences. Aggregation or clustering by similarity 
thereby reduces data size and complexity. Such a task requires quantifying the 
similarity between two protein complexes. 


When there are tens of thousands of different complexes subject to pairwise comparison, 
a similarity measure must be efficiently computable. 

This contribution is based on a conference paper by the authors [644], adapted by 
permission from Springer Nature; Copyright © 2019 Springer Nature Switzerland AG. 


2.6.1.1 Models for Protein Complexes 
We first discuss models for protein complexes at different levels of detail, namely the 
set, multiset, and graph models. 
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While intuition suggests that protein complexes can be naturally described as graphs 

with proteins as vertices and physical interactions as edges, there are in fact different 

ways to formally describe a protein complex. We start with a given set P of all proteins 
of an organism, the building blocks of the complexes. 

Set Inits most simple form, a protein complex can be defined as a set (in the mathemat- 
ical sense, i.e., without multiplicities) of proteins, i.e., as a subset {p1, p2,..., Dn} 
of P. Sets neither capture the multiplicities nor the nature of the physical interac- 
tions between the constituent proteins of a complex. Some experimental techniques 
only give set-type information, and several existing databases only provide this 
type of information, e.g., the CORUM database [551]. 

Multiset Formally, a multiset is a function C : P > No that assigns a multiplicity to 
each protein p € P with C(p) = 0 for proteins p that are not part of the complex. 
We also use the multiset notation C = {p1, pı, p2} to express that C(p1) = 2, 
C(p2) = 1 and C(p) = 0 for all other p € P. Defining a protein complex as a multiset 
of proteins gives a more accurate representation of the complex, but still does not 
consider the interaction topology. 

Graph To add more information, we can define a protein complex as an undirected 
graph C = (V, E, £) with labeled vertices V, such that each vertex v € V represents 
a protein and hence has a label (v) € P, each edge e € E C Vx V represents a 
physical interaction between the corresponding proteins, such that E is symmetric 
and C is connected. The graph description provides the interaction topology. We 
call this representation a protein complex graph and define its size as |C| := |V| + |E]. 
In the following, we use the terms protein complex and protein complex graph 
interchangeably. 


A remark is in order to avoid confusion with protein interaction networks. Our definition 
of the graph model of protein complexes is formally identical to the definition of protein 
interaction network. However, there are important differences. The protein complex 
graph represents an assembly composed of multiple proteins that physically bind 
and co-exist in one temporal and physical space, while a protein interaction network 
represents proteins that may interact at some point of time, where individual interaction 
time points may be different. Hence, the protein complex graphs considered in this 
work typically consist of only a few vertices and edges, while interaction networks are 
much larger. 

For the set and multiset models, a similarity measure is readily given by the Jaccard 
similarity (see Section 2.6.2.1). For graphs, various techniques exist such as graph 
kernels [353] or graph matching [741]. A particularly intuitive approach is the graph edit 
distance, which has been proposed for pattern recognition tasks more than 30 years 
ago [574]. A graph edit distance between graphs C and C measures the total costs of 
the edit operations required to transform C into C’. Defining similarity via graph edit 
operations appears natural, but has computational disadvantages, as the graph edit 
distance generalizes the classical maximum common subgraph problem [109], which 
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is NP-complete [220] and hard to approximate with given guarantees [311]. Therefore, 
a large number of heuristics for computing the graph edit distance without provable 
guarantees have been proposed. A particular successful class of heuristics derives 
the edit costs from the solution of a linear sum assignment problem [90, 352, 535]. 
Recently several elaborate exact algorithms for computing the graph edit distance 
have been proposed, but are still limited to small graphs [132, 239, 388]. The binary 
linear programming formulation of [388] allows researchers to compare graphs of 
moderate size using highly-optimized general purpose solvers. However, when we want 
to compare many complexes, evaluating the edit distance between all pairs becomes 
infeasible in practice. 

We therefore propose an efficient alternative. We define a family of similarity mea- 
sures on graphs using the Jaccard similarity, which can be efficiently computed and 
even more efficiently estimated with locality-sensitive hashing techniques. Taking the 
graph structure into account is achieved by the Weisfeiler-Leman labeling of the vertices 
[724], propagating vertex labels between neighbors. This approach is different from 
recent work that approximates and bounds the graph edit distance [534] and has the 
advantage of scaling better to large-scale studies. 

We find that the Weisfeiler-Leman (WL) similarity approximates edit similarity 
well, but can be computed much more efficiently. In addition, in large-scale database 
searches for complexes similar to a given query complex, we obtain an additional speed- 
up by an order of magnitude when filtering for high WL similarity using min-hashing. 
Finally, we discuss limitations and possible extensions of this work. 


2.6.2 Methods 


Our goal is to define a similarity measure between protein complexes that captures not 
only the (multisets of the) constituent proteins, but also the interaction topology (graph 
structure). We introduce a parameterized family of similarity measures on protein 
complexes, which are based on multiset comparisons of vertex labels in the graph 
representation and take the local neighborhood of each protein into account by using 
Weisfeiler-Leman labels. 


2.6.2.1 Jaccard Similarity of Sets and Multisets 
To compare sets or multisets, Jaccard similarity coefficients are an established quantity. 
Let M C U and M C U be two subsets of a common universe U. Then the Jaccard 
similarity between M and M is defined as 
_ |MnM| 


Tset(M, M) = |MU M| = [0, 1] . (2.20) 


This definition is extended to multisets as follows. Recall that multisets M and M are 
functions U > No, assigning multiplicities M(o) and M (0) to each object o € U. (The 
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set definition can be seen as the special case where the value set is only {0, 1} instead 
of No.) Then the Jaccard similarity between M and M is defined as 


— Yoey min{M(o), M (0)} 
S ocy Max{M(o), M’(o)} 


Tmultiset(M, M’) : [0, 1]. (2.21) 


The definition of J multiset can be reduced to that of J se, by augmenting the element names 
with a running index in each multiset. For example, 


Jmultise({A, A, B, C, C}, A, C, C, CH) = Iset({A1, A2, B1, C1, C2}, {A1, C1, C2, C3}). 


Using this transformation, sketching techniques like min-hashing [105] that primarily 
work on sets can be extended to multisets. 


2.6.2.2 A Parametric Family of Protein Complex Similarity Measures 

Instead of comparing (protein complex) graphs directly by their labels and graph 
topology, we extract and compare multisets of derived features that represent local 
neighborhood information. Encoding the local structure surrounding a vertex is a gen- 
eral method widely used in graph matching and machine learning with graphs. Various 
concepts and techniques for this have been proposed, e.g., the k-hop neighborhood 
[319, 459], or the k-sphere neighborhood [37]. Weisfeiler and Leman developed an itera- 
tive label refinement procedure to derive a canonical graph representation for graph 
isomorphism testing [724]. The same procedure is often used to define graph similarities 
or graph kernels [602]. This approach recently became popular in machine learning for 
its expressivity and favorable algorithmic properties. Several highly efficient graph ker- 
nels based on Weisfeiler-Leman refinement have been proposed [351, 602], and several 
graph neural networks were shown to be at most as expressive as the Weisfeiler-Leman 
method [737]. 

The technique works as follows. Initially, the feature multiset of a graph consists 
of the union of all vertex labels, i.e., the protein names. After the initialization, the 
vertex labels are iteratively augmented by the labels of the neighboring vertices from 
the previous iteration, thereby encoding the (local) graph structure in the vertex labels. 
As previously mentioned, we label the vertices of graphs with the protein names; so 
two vertices that refer to the same protein type are labeled identically. Thus, the initial 
feature multiset is identical to the multiset representation as described above. Let us 
now formally define the process. 


Definition 1 (Weisfeiler-Leman labeling). Let C = (V, E, £o) be a graph with label func- 
tion 4o : V > Lo := P, where P is the initial set of vertex labels (e.g., protein names for 
protein complex graphs). Furthermore, let N(v) := {u | {v, u} € E} denote the neigh- 
bors of vertex v € V. Then, the Weisfeiler-Leman labeling of iteration i is defined as a 
re-labeling of the graph. It replaces the current labeling function 4;-1 : V > Li- with 
a new labeling function 4; : V > L;. The value of 4 for a vertex v € V is recursively 
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A Protein complex C B Multiset representation C WL feature set of iteration 1 
multiset of node labels label(node) = (node label, 
CDC42 compressed neighbor node labels) 
WL (C)={{ WL (C)=4 7 — 
CDC42, CDC42, PAK4, (0,£0,1,3}), (OOS 4H), 
PIK3CD, PIK3R1, (1,03), (2, (38), (3, (0), 
PIK3R1, VAV1 3,{0,2}), (4,0 
(ren) ae (ra) 3 (3,0,23), (4.403) 3 
Compressed Representation: Compressed Representation: 
{{0,0,1,2,3,3,4} {{5,6,7,8,9,10,11} 


Fig. 2.27: Example of a protein complex and its representations. The colors highlight the labels of 
an example node in WLo(C) and WL,(C). A: Graph representation of protein complex C. B: Multiset 
representation of C which is equal to WLo(C). C: Result of the first WL iteration. 


defined as 
GC) := (4-1), {4-1(w) | u € N0) Y). (2.22) 


Note that the second component of the new label is a multiset. 


To avoid increasingly complex labels consisting of nested multisets, label compres- 
sion is performed after each step. This is achieved by applying an injective function, 
which maps a pair consisting of a label and a multiset of labels of the form given in 
Equation 2.22 to an integer label. The label compression step must be consistent across 
multiple graphs in order to construct comparable feature sets. If all graphs in the dataset 
are known from the beginning, we can sort all the multisets of one iteration, identify 
the identical pairs, and assign them to the same new integer label. This step can be 
realized in time linear in the total number of edges of all graphs by applying variants of 
bucket sort [602]. A less efficient, but more flexible approach, which is suitable even 
when the graphs are only revealed successively, is to manage the injective map used for 
label compression in a hash table. 

Given the Weisfeiler-Leman labeling function for any iteration i, we can now define 
the multiset of Weisfeiler-Leman features for iteration i. 


Definition 2 (Weisfeiler-Leman feature set). Let C = (V, E, £o) be a graph with label 
function £o : V > Lo = P, where P is the initial set of vertex labels. Then, the Weisfeiler- 
Leman features of iteration i are defined as multiset WL;(C) = ka) |ve Vy. 


Note that WLo(C) always corresponds to the initial multiset of labels (protein names). 
Accordingly, WL;(C) integrates the neighborhood labels of each node. Figure 2.27 
shows an example protein complex graph, together with the associated feature sets 
WLo(C) and WL;(C). A node and its neighborhood are highlighted in red and blue to 
demonstrate the relation between WLo(C) and WL;(C). 

We use the Jaccard coefficient to obtain a normalized similarity based on multiset 
intersection. We apply the Jaccard coefficient to the feature sets of each iteration individ- 
ually and compute a convex combination of the results. Let w = (w;)ixọ be a sequence of 
non-negative weights with 5°;.. Wi = 1. We quantify the weighted similarity between 
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two labeled graphs C and C by 


Sw(C, C) = > wi + Jmuttiset(WLi(C), WL(C)), (2.23) 


i20 


where J muttiset is given by Equation 2.21. This defines a family of similarity measures 
between labeled graphs with values in [0, 1], parameterized by the weight vector w = 
(Wo, Wi,.--)- 

It is easy to see that, as long as wo > 0, we have Sw(C, C) = Oifand only if the label 
sets of C and C are disjoint. If Sw(C, C’) < 1, the graphs are not isomorphic. However, 
Sw(C, C’) = 1 does not necessarily imply that C and C are isomorphic even if w; > 0 for 
all i. There exist examples of non-isomorphic graphs G, G with WL;(G) = WL;,(G) for all 
i > 0. (As a simple example, take G to be a cycle of six vertices, and G' to be two cycles of 
three vertices, all with the same label.) On the other hand, there exist classes of graphs, 
suchas the CR-graphs, for which the implication “Sw(C, C)=15C,C are isomorphic” 
is true if w; > O for all i [31]. Moreover, the implication holds with high probability for 
random graphs (without vertex labels) even when w; = 0 for all i > 3 [34]. 

In our application scenario, we may assume that most protein complexes are non- 
adversarial graphs with sufficiently simple structure and expressive initial labels such 
that their Weisfeiler-Leman features are appropriate to characterize their similarity. In 
fact, we put forward the hypothesis that using a single iteration is frequently sufficient 
for practical purposes, and we set w; := 0 for i = 2 in our computational experiments 
(see Results) and only have a single free parameter w, € [0, 1] that defines wo := 1-w}. 
In this case, Sw, is efficiently computable: a proof of the following lemma can be found 
in the work of [602]. 


Lemma 3. For wı € [0, 1], each of the one-parameter similarity measures 


Sw (C, C^) = (1 - wi) *Jmuttiset(WLo(C), WLo(C)) 
+ w1 * Jmutiset(WL1 (C), WL1(C)) 


can be computed in O(|C| + |C’|) time, where |C] = |V] + |E]. 


2.6.2.3 A Similarity Measure Based on Graph Edit Distance 

To compare the family of Weisfeiler-Leman multiset-based similarity measures defined 
above with graph edit distance, we state a formal definition of the edit-based similarity. 
We allow the following elementary operations to edit a graph: vertex deletion, vertex 
insertion, vertex relabeling, edge deletion, and edge insertion. A sequence (01,..., 0x) 
of such edit operations that transforms a graph G into another graph H is called an edit 
path from G to H. Each operation o is assigned a cost c(o), which is zero for substituting 
vertices and edges with the same label. We use a cost of 1 for all operations except 
vertex relabeling, which has a cost of 2, corresponding to one deletion and one insertion 
(leaving the edges in place). Note that deleting or inserting a vertex of degree k otherwise 
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has cost k + 1 for deleting k edges and the vertex itself. We denote the set of all possible 
edit paths from G to H by Y(G, H). 


Definition 4. Let G and H be labeled graphs. The graph edit distance from G to H is 
defined by 


k 
d(G, H) = maly c(o;) | (01,---, 0x) € Y(G, m) . (2.24) 


i=1 


Intuitively, the graph edit distance preserves a subgraph G of G that is also contained 
in H using zero-cost substitutions, deletes the vertices and edges in G that are not in 
G’, and then inserts vertices and edges to obtain an isomorphic copy of H. Therefore all 
non-zero costs can be attributed to the elements that are in one of the graphs, but not in 
their common subgraph. In this sense the graph edit distance is similar to the symmetric 
difference of two sets. This observation motivates the following normalized similarity 
measure derived from the graph edit distance. We define the graph edit similarity as 


_ |G|+|H|-d(G, H) 


se (eei [G| + |H] + d(G, H) 


[0,1], (2.25) 


where |G| := |V(G)| + |E(G)|. Note that the graph edit distance between G and H is at 
most |G| + |H|, which is achieved by deleting all vertices and edges of G and inserting 
all vertices and edges of H. In this case the graph edit similarity is zero. Similarly, 
S2(G, H) = 1 if and only if d(G, H) = 0. In this respect the similarity measure resembles 
the Jaccard similarity. In fact, the following lemma shows that, if the edges are not 
taken into account, the graph edit similarity equals the multiset Jaccard similarity. 
Therefore, the graph edit similarity can indeed be seen as a natural extension of the 
multiset Jaccard similarity to graph structured data. 


Lemma 5. For two vertex-labeled graphs G, H, let C, D denote their respective label 
multiset. For the edge-free graphs G = (V(G),0) and H = (V(H), 0) it holds that 
SAG, H’) = J muttiset(C, D). 


Proof. An optimal graph edit path is obtained as follows: We substitute the vertices 
with common labels free of cost, which are Z = pë p min{C(p), D(p)} in total. We 
delete the remaining |G | - Z vertices in G and insert |H | - Z vertices to obtain an 
isomorphic copy of H at a total cost of |G| + |H | - 2Z = d(G, H). Instead we may also 
substitute up to | |G | — |H’| | vertices, each at cost two, which results in the same total 
cost. Using the fact that |G | = X pep C(p) and |H | = X pep D(p), we obtain the result 
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by calculating 
roy (Gl+|H|-aG,H)_Z 
Sg(G,H) IG|+|H|+d(G,H) |G|+|H|-Z 

z Z 
Eper CW) + pep DP) -Z 

a pep min{C(p), D(p)} 
Viper CP) + DP) - min{C(p), D(p)} 

_ pep min{C(p), D(P)} _ 
X per max{C(p), D(p)} Tmuttiset(C, D) i 


2.6.2.4 Database Searches 
An important application of a similarity measure S(-, -) is for similarity searches in large 
databases. When searching for similar protein complexes to a given input complex 
(“query” Q) in a large database, one can perform a linear scan and compute S(Q, C) for 
each complex C in the database and report those with S(Q, C) > T for a given threshold 
T. However, computing S(Q, C) exactly may be computationally expensive, and in many 
cases, a quickly computable upper or lower bound can be used as an initial filter. 

In Section 2.6.3, we evaluate the proposed similarity measure Sw, against graph 
edit similarity Sg, and for database searches we use individual filtering techniques as 
described in this section. 


Weisfeiler-Leman Similarity For Sw, a speed-up is possible using min-hashing [105], 
which is a locality-sensitive hashing scheme for the Jaccard similarity between sets that 
can be extended to multisets, as described in Section 2.6.2.1. We use a simple approach 
that maps the WL;(C) multisets to integer hash values h; (C) using a large number 
G = 1,..., K) of different random hash functions. The exact Sw(Q, C) value is only 
computed if any of the hash values agrees with that of the query, i.e., if h; ;(C) = h; j(Q) 
for any i = 0, 1 andj =1,..., K. The number K of hash functions is chosen such that 
the false negative error rate is lower than a given probability threshold (0.01). 


Graph Edit Similarity The graph edit distance is widely used for searching graph 
databases and several approaches tailored to this task have been proposed [132, 323, 
397, 712, 755, 768, 769, 770]. These methods typically follow a filter-verification approach 
[755]. In the filter phase, efficiently computable lower bounds on the graph edit distance 
are used to eliminate dissimilar graphs. The remaining graphs are then verified using 
upper bounds on the graph edit distance and, if necessary, by an exact algorithm to 
obtain the final result. The methods can be categorized according to different criteria. 
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Several methods compute a lower bound for every graph in the database [132, 323, 755], 
while the others use an index data structure to find candidates without scanning the 
whole database. The lower bounds are often derived from overlapping substructures, 
while some methods partition the graphs into disjoint parts [323, 397, 769]. Recently, the 
large number of proposed lower and upper bounds were systematically studied in an 
extensive experimental evaluation according to their running time and tightness [90]. 

Most of the above mentioned algorithms assume a uniform edit cost function and 
cannot directly be applied to solve database search problems with respect to the graph 
edit similarity as defined in Equation 2.25: To make graph similarity compatible with 
the Jaccard index, we use a cost of 2 instead of 1 for vertex relabeling. Nevertheless, 
several of the efficiently computable bounds for uniform graph edit distance can be 
generalized for this case, and we have implemented such generalizations (see below). 
Then, by substituting d(G, H) in Equation 2.25 by known lower (upper) bounds on the 
graph edit distance, we obtain upper (lower) bounds for the graph edit similarity. 

Since the vertex labels denoting proteins are highly specific when comparing pro- 
tein complexes, we derive a first lower bound on the graph edit distance from the 
cardinality of the symmetric difference of the two vertex label multisets [323, 768]. This 
is equivalent to using Jmultiset aS an Upper bound for the graph edit similarity. 

As a second lower bound we use a more expensive approach based on the linear sum 
assignment problem, which was shown to provide a good trade-off between tightness 
and running time [90]. The assignment instance is defined on the vertices of the two 
graphs, where additional dummy vertices are introduced that represent vertex insertion 
and deletion. The costs for assigning individual vertices is made up of the costs for 
substituting the vertex label and the cost of an optimal assignment between the incident 
edges [535]. Since we do not consider edge labels, the assignment cost matrix can be 
computed in quadratic time (cf. heuristic BRANCH-CONST in [90]) and the instance can be 
solved in cubic time. The cost of the assignment instance serves as a lower bound on the 
graph edit distance. Following [535], we obtain an upper bound from the cost of the edit 
path derived from the assignment. If this is not sufficient for verification, we compute 
the exact graph edit distance using the binary linear programming formulation of [388]. 


2.6.3 Results 


2.6.3.1 Hypothesis 

We hypothesize that finite truncations of the Weisfeiler-Leman-based family of similarity 
measures Sw (defined in Equation 2.23), in particular Sw, (Lemma 3), are in good 
agreement with the edit similarity (defined in Equation 2.25) for typical protein complex 
graphs. Especially Sw, has the advantage that it can be efficiently computed. 


2.6 Protein Complex Similarity —— 95 


2.6.3.2 Experimental Setup 

We have implemented the similarity measures based on Weisfeiler-Leman labeling 
and the graph edit similarity in Java 8. To compute the exact graph edit distance, we 
used a recent binary linear programming formulation [388] and solved the instances 
using Gurobi 7.5.2. All experiments were run on 18-core Intel Xeon E5-2699 CPUs with 
2.3 GHz and 512 GB RAM, using 64-bit Ubuntu Linux 18.04. Our analysis is available as 
a Snakemake workflow [345] on github (https://github.com/BiancaStoecker/complex- 
similarity-evaluation). 


2.6.3.3 Data Generation 

As mentioned in the introduction (Section 2.6.1), obtaining graphs from real protein 
complexes is difficult at the moment, because experimental techniques that resolve 
the (graph) topology of the complexes are still in the developmental stage. Therefore 
we resort to the simulation of complexes, based on two types of knowledge: possible 
physical protein-protein interactions, formalized by a protein interaction network, and 
constraints between protein interactions. Especially the second type of information 
allows us to simulate more realistic complexes than what we would get from interaction 
networks alone. 

Formally, a protein interaction network is an undirected graph N = (P, I), where P 
is the set of protein types of a cell (or an organism), and I c P x P indicates the pairs of 
protein types that may potentially physically interact. Since N describes the entirety of 
possible interactions, any protein complex can be seen as a connected subgraph of N. 

Protein interactions are not independent of each other, but interdependent. Those 
interaction dependencies are generated by two major mechanisms: allosteric regulation, 
in which the capability ofa protein to bind other proteins is affected by a conformational 
change upon one interaction [374], and steric hindrance, which prevents proteins from 
binding to identical or nearly identical protein domains, leading to mutual exclusive- 
ness of interactions [573]. The dependencies between interactions constrain the set 
of possible protein complexes and their assembly paths. One possible model for this 
are constrained protein interaction networks, where the protein interaction network is 
enhanced by the interaction dependencies (constraints) modeled as propositional logic 
formulas [649]. With constrained protein interaction networks, we can stochastically 
simulate complex formation based on the available knowledge and obtain a detailed 
interaction topology (which proteins physically interact) for each complex. 

For the simulation, a constrained protein interaction network was generated from 
the human adhesome protein network and a set of interaction dependencies obtained 
from protein domain interaction databases and manual curation (for details, see [649]). 
Then, protein complex assembly was simulated in a step-wise process, with association 
and dissociation rates calibrated to fit the complex size distribution of the CORUM 
database [551], until reaching convergence. The obtained complexes mimic the size 
distribution of known complexes, while also providing information about the actual 
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Tab. 2.5: Properties of the 2242 972 simulated protein complex graphs. There are 717 distinct ver- 
tex labels (protein names), the most frequent ones being ’GRB2’ occurring 100 009 times, EGFR 
occurring 84 706 times, and ’CRK’ occurring 83 117 times. The least frequent labels are ’PHF20’ (55x), 
*CDYL’ (63x), and ’PHF20L1’ (75x). The average label frequency is 14561.6. The right table shows how 
the number of distinct labels increases with WL iterations; here |WL.;| refers to the cardinality of the 
set of all labels up to the i-th WL iteration. 


Quantity Average Min Max Label set sizes 
Iv] 4.655 3 126 |WLo| 717 
|E| 3.655 2 125 [WL] 461657 
Degree 1.570 1 28 |WL.2| 4114569 
Diameter 2.709 2 28 |WL.3| 9 237071 
Density 0.530 0.016 0.667 |WLeg] 14703231 


A B C 
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Fig. 2.28: Three exemplary pairs of protein complexes. Each labeled node is a protein instance, 
each edge a protein interaction, and solid black vs. dashed red edges distinguish between the 
two complexes. A: Edit similarity 0.714; WL similarity in [0.4, 0.75] depending on weight wi. B: 
Edit similarity 0.838; WL similarity 1.0 (independent of w1). C: Edit similarity 0.9; WL similarity in 
[0.667, 0.818] depending on w1. 


ALB 


physical interactions happening inside the complex, an information that is currently 
not yet available for real data. Over 2.2 million protein complex graphs were simulated 
in this way. Some statistics are given in Table 2.5: Most simulated graphs are small 
and tree-like (|V| = |E| + 1) and consist of low-degree nodes. We were able to verify 
that all distinct simulated graphs can be distinguished by WL labels after at most two 
iterations. 

To evaluate the Weisfeiler-Leman based similarity (“WL similarity”) against the edit 
distance based similarity (“edit similarity”), we computed both measures on selected 
pairs of simulated complexes. 


2.6.3.4 Illustrating examples 

We first consider three exemplary pairs (Figure 2.28 A-C) with edit similarities of approx- 
imately 0.7, 0.8 and 0.9, respectively, the latter being the most similar observed pair. 
Our simulation has been calibrated to yield complexes of a realistic size distribution 
that additionally reflect all currently known interaction dependencies. However, since 
this data is likely incomplete and we also did not consider the law of mass action, we 
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Fig. 2.29: Comparison of edit similarity and WL similarity. A and D: Scatterplot between edit simi- 
larity and WL similarity for weight wı = 0.59 with maximum Pearson correlation (A) and w; = 0.31 
with maximum cosine similarity (D), including marginal distributions and least-squares regression 
line. Each point represents a pair of complexes. B: Pearson correlation coefficient between edit and 
WL similarity as function of w1. The maximum 0.946 occurs for w, = 0.59 (scatterplot A). E: Cosine 
similarity as a function of wi. The maximum 0.983 occurs for w, = 0.31 (scatterplot D). C and F: 
Heatmap showing the Pearson correlation (C) and cosine similarity (F) over weights w, and w2 when 
using 2 WL iterations and wo = 1 — (w1 + w2) 2 0. 


do not claim that the particular combination of proteins in these examples is likely to 
occur in reality. The examples are therefore only meant to illustrate the behavior of 
the two measures and give an intuition of cases where WL similarity fails to properly 
approximate edit similarity. 

In example A, an additional protein (PTPN3) is added to an existing complex, a 
linear chain of 3 proteins. The edit similarity is 10/14 = 0.714, the WL similarity is 
between 0.75 for w; = Oand 0.4 for w1 = 1. Because the edit similarity is between the 
extreme WL similarities, there exists a unique weight w} ~ 0.102, for which WL and 
edit similarities agree for this particular complex pair. Example B is a noteworthy case, 
because the WL similarity is 1.0, independently of w1, because the vertex labels are 
identical even after the first Weisfeiler-Leman iteration. (Further iterations would show 
a difference.) The edit similarity is 20/24 = 0.83, which is obtained by attaching ALB 
to the other LRP2 protein. In example C, one protein is replaced by another one ina 
fairly large complex. The edit similarity (0.905) is relatively high and outside the WL 
similarity range between 0.667 for w4 = 1 and 0.818 for w, = 0. 


98 —— 2 Health / Medicine 


2.6.3.5 Large-Scale Comparison 

For the following comparison, we considered only a stratified subset of all possible 
pairs for the analysis, because calculating the edit similarity is computationally costly. 
To obtain candidate pairs, we considered all pairs of complexes that have at most 20 
proteins (larger complexes are so rare that high similarities are unlikely), that have a 
size difference of protein multisets of at most 10, and that share at least one protein. 
These were sorted in descending order according to the number of shared proteins. Then 
the edit similarity was computed on the first 500 000 candidate pairs, and the similarity 
values were grouped into bins of width 0.1. Because most pairs of complexes share a 
small number of proteins, we find many pairs with small edit similarity (but none in the 
range [0.0, 0.1[ because we required one common protein) and fewer pairs with edit 
similarity above 0.5. To achieve a uniform distribution among bins for the comparison, 
we randomly selected 1000 pairs from each bin, excluding the bin [0.9, 1.0[ which 
contained a single pair. This yielded 8000 pairs of complexes from 8 bins. 

Because most protein complexes are small and do not exhibit properties of ex- 
amples B or C of Figure 2.29, the overall agreement between WL similarity and edit 
similarity is high. For each of the selected complex pairs, we computed the exact edit 
similarity and the WL similarity for each weight w € W := {0.0,0.01,..., 1.0} and 
Wo := 1- wy. Let e be the vector of edit similarity values and s(w1) the corresponding 
vector of WL similarity values using weight w, for WL in the first WL iteration. To com- 
pare the similarity measures, we calculated both the Pearson correlation coefficient and 
the cosine similarity of e and s(w,) for all w4 € W. As can be seen from Figure 2.29B, the 
highest Pearson correlation values occur for w; between 0.56 and 0.62. The maximum 
Pearson correlation coefficient of 0.946 is obtained for w1 = 0.59. Figure 2.29A shows 
the scatter plot between the similarities for this weight. For the cosine similarity, the 
maximum value is reached for weight w1 = 0.31, but the function is less peaked, and 
all values of w; < 0.6 lead to high agreement (Figure 2.29E). 

To quantify the possible benefit of additional WL iterations (and hence a larger 
space of possible weight vectors), we first repeated the calculations with an additional 
second WL iteration. We calculated the similarity measure for all weight combinations 
(w1, w2) E W x W with wo := 1- (w1 + w2) = O and wọ + w1 + W2 = 1. The Pearson 
correlation and cosine similarity over the weights w4 and w are shown in Figure 2.29 C 
and F. The weight combinations that lead to maximum correlations all have w = 
O and therefore are the same as in the similarity measure without second iteration 
(Figure 2.29 A and D). Therefore, the additional WL iteration provided no benefit for 
approximating edit similarity. 

Overall, we find good agreement between edit similarity and WL similarity for 
appropriate values of w1; values around w, = 0.5 yield both high Pearson correlation 
and cosine similarity. 
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2.6.3.6 Runtime Comparison 

To study the practical running time of both similarity measures, two different settings 
were evaluated. In the first setting, all pairwise similarities in a subset of the dataset 
were computed. This setting is a typical sub-task of clustering or machine learning 
tasks on metric distances. In the second setting, the database search application was 
evaluated. Given a set of query complexes and a similarity threshold, similar complexes 
in the dataset were searched. 


Pairwise Similarities We measured the time required to compute all 10 000 pairwise 
similarities for a subset of 100 graphs, drawn at random from the dataset described in 
Section 2.6.3.3. For the Weisfeiler Leman-based similarities, the computation can be 
divided into two steps. In the first step, the Weisfeiler-Leman feature vectors are com- 
puted for each of the 100 graphs (WL-FV). In the second step, they are used to compute 
the similarity values between all 10 000 pairs (WL-SIM). Thus, the computational costly 
part is computed only once for each graph, while the quadratic number of comparisons 
is lightweight. This kind of preprocessing is not possible for the graph edit similarity, 
where each pairwise similarity computation is costly (GES). 

Figure 2.30 shows the violin plot of the measured times for each single feature 
vector and pairwise similarity calculation. Running times are shown separately for 
one and two WL iterations (WL-[FV|SIM]-[1|2]; GES). As expected, the calculation with 
two WL iterations is slower than using only one iteration, but the difference is small. 
Most importantly, we observe that a single graph edit similarity computation is two 
to four orders of magnitude slower than a single Weisfeiler Leman-based similarity 
computation (median of over 10° vs. approx. 10? ns, but extreme outliers are visible 
from the violin plots). While Figure 2.30 shows the times for single-instance calculations, 
Table 2.6 shows the total times for all 100 (FV) and 10 000 (SIM) calculations. For 
Weisfeiler-Leman similarity, we observe that computing the feature sets dominates 
the running time. (This eventually changes if more graphs are considered since the 
WL-FV time grows linearly but the WL-SIM time grows quadratically with the size of the 
dataset.) Overall, the GES computation is slower by more than four orders of magnitude, 
which is influenced by a few extremely slow GES computations. 


Database Search We used the entire set of graphs described in Section 2.6.3.3 asa 
database and 500 randomly selected graphs as queries. For the Weisfeiler Leman-based 
similarities, we evaluated a linear scan over the database (wl_linear) and the min- 
hashing speed-up described in Section 2.6.2.4 (wl_minhash) with false negative rate 0.01 
and weights wo = w1 = 1/2. Times were measured separately for the Weisfeiler-Leman 
feature set calculation, the hash table creation (for w1_minhash only), and the queries 
itself. For the graph edit similarity (ges), a linear scan over the database is prohibitive 
and hashing schemes are not readily available. Therefore, we employed the filter- 
verification approach described in Section 2.6.2.4 and measured the time for filtering 
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Fig. 2.30: Running times. Violin plots of wall-clock times for computing each of the 10 000 pairwise 
graph edit similarities (GES), the 100 Weisfeiler-Leman sets (WL-FV) and their 10 000 pairwise Jac- 
card coefficients (WL-SIM) with either one or two Weisfeiler-Leman iteration(s) in three independent 
runs. The time axis is logarithmic. 


Tab. 2.6: Total running times for each of three runs in seconds. WL-Total = sum(WL-FV) + sum(WL- 
SIM) for either 1 or 2 Weisfeiler-Leman iteration(s). Here sum( refers to the sum over the 100 feature 
vector calculations (WL-FV) and 10 000 similarity calculations (WL-SIM) whose time distribution is 
depicted in Figure 2.30. 


Time [s] Run 1 Run 2 Run 3 
GES 851.222 809.087 845.883 
WL-Total-1 0.050 0.056 0.045 
WL-Total-2 0.066 0.071 0.055 
sum(WL-FV-1) 0.005 0.006 0.005 
sum(WL-SIM-1) 0.045 0.050 0.040 
sum(WL-FV-2) 0.008 0.006 0.006 


sum(WL-SIM-2) 0.058 0.065 0.049 
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Fig. 2.31: Running time comparison of the three approaches wl_minhash (purple), wl_linear (green), 
and ges (blue) for a database search with 500 queries and a database of = 2 million complexes. 


and verification separately. All three algorithms were run with the similarity thresholds 
0.6, 0.7, 0.8 and 0.9 and each run was repeated three times. Further, each run was 
executed using only a single CPU core for better comparability of CPU time usage; in 
practice, many database complexes can be evaluated in parallel, independently of each 
other. The results are shown in Figure 2.31. 

Itcan be seen thatwl_minhash and wl_linear need a similar amount of time for the 
feature vector calculation, but the query time is much smaller for wl_minhash because 
only a small fraction of the database complexes needs to be evaluated. These savings 
outweigh by far the additional indexing time required to build the hash tables, even 
for only 500 queries. Therefore, wl_minhash will scale even better to larger numbers 
of queries. The filter-verification approach of the graph edit similarity (ges) is slightly 
faster than wl_linear for thresholds 0.7, 0.8, and 0.9, because using the similarity 
bounds often suffices for a decision and the actual graph edit distance does not need 
to be computed often. For threshold 0.6, however, many verifications and hence ex- 
act computations are necessary; so the running time for ges increases above that of 
wl_linear. 
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2.6.4 Discussion 


Our original motivation to consider protein complex similarity was to reduce the size of 
the simulation output of our constrained protein interaction network simulator [649] 
via clustering, and we were surprised to see that apparently, no similarity measures 
have been proposed explicitly for protein complexes in the literature. Depending on 
the underlying representation (set, multiset, or graph), different alternatives suggest 
themselves. However, most graph-based measures are both theoretically and practically 
hard to compute for larger complexes or for large amounts of complexes. While different 
tractable graph similarity measures [705] and an approximate graph edit distance [534] 
have been developed, none of these appear to be specifically tailored to the properties 
of protein complexes (often less than ten vertices; sparse). 

Our proposal to define the similarity as a convex combination of two Jaccard coeffi- 
cients (protein label multiset and Weisfeiler-Leman label multiset after one iteration) 
has several beneficial properties. We have shown that it can approximate edit-based 
similarity with high Pearson correlation and cosine similarity, and at the same time, 
that it can be computed much more efficiently. Further, for weight wo = 1 of the 0-th WL 
iteration, WL similarity reduces to the natural similarity measure of the multiset repre- 
sentation. Our framework hence allows for a smooth transition between multiset and 
graph representation. The comparison with an edit-based similarity seems to indicate 
that the protein label multiset plays an important role if one wants to approximate the 
edit similarity, the first WL iteration provides additional graph information, and the sec- 
ond WL iteration does not provide further benefits, probably because most complexes 
consist of few proteins. In addition, in large-scale similarity searches, using Jaccard 
coefficients allows us to efficiently pre-filter for high similarity using locality-sensitive 
hashing. In combination with the preprocessing abilities discussed in the experimental 
running time comparison, this allows for very fast search queries, clusterings, and 
other applications that rely on intensive distance computations. 

In the present work, we have not considered individual similarity between vertex 
labels (i.e., protein types): We treat two labels as either equal or distinct. While it is 
relatively straightforward to allow arbitrary label similarities in the graph edit distance 
framework, and, via Equation 2.25, in the graph similarity framework, this generaliza- 
tion appears less straightforward for WL similarity and will be investigated in future 
work. 

From a biological point of view, a high similarity value between two protein complex 
graphs should indicate a high probability that the complexes share biological functions 
and can (partially) substitute each other in a cellular process. If comprehensive func- 
tional information were available, we could use it for evaluating different similarity 
measures and decide which ones best capture the biological reality. At present, when 
not even the interaction topology of most complexes has been determined, such an 
evaluation is unfeasible, but this may change in the future. 
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Abstract: Industry 4.0 is the connecting element of all contributions in this chapter. 
The notion of a physical thing moving within an industrial process unites the research 
presented. Logistics is a driving force behind Industry 4.0 developments and that the 
concept of a cyber-physical twin is one of its fundamental building blocks. This keynote 
introduces and develops the mindset for fully appreciating the following contributions. 
They range from the creation of a steelbar to autonomous swarms of drones, providing 
a wide-ranging selection of current developments. 


3.1.1 Introduction 


The term “Industry 4.0” heralds a fourth industrial revolution. The factory of the future 
is built upon a hyper-connected, smart, and autonomous infrastructure. It promises to 
deliver high adaptability with optimal use of resources. Industry 4.0 is expected to bring 
considerable benefits for production sites, factory equipment suppliers, and business 
software providers, on both the productivity and the revenue side. Under Industry 
4.0, digitization penetrates all areas of industrial process chains—from production to 
distribution to recycling and waste management—and is thus spurring expectations 
and ideas regarding the future design and implementation of processes, facilities, and 
systems. 


3.1.2 Industry 4.0 


Amid the wide variety of views on the term Industry 4.0, this keynote adheres to the 
German concept of a Plattform Industrie 4.0. According to this, the term Industry 4.0 
“stands for the fourth industrial revolution, a new stage in the organization and control 
of the entire value chain” and is intended to establish a link to the three previous 
industrial revolutions. Many authors refer to Industry 4.0 collectively as the digital 
transformation of the manufacturing sector. A comprehensive overview of the field is 
given in [279], [548], and [697]. 

As the term revolution suggests, the transformation is likely to present a variety 
of opportunities for an economy that can deal with the disruption that comes with 
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profound change. In addition, the successful implementation of Industry 4.0 concepts 
and technologies requires the ability to easily change business processes with regard 
to new requirements. This sense of a high adaptability calls for the creation of a kind of 
“deep transparency” to efficiently evaluate decision-making situations. The basic vision 
of Industry 4.0 thus stems from the age-old desire for the constant availability of all 
relevant information about all physical objects at all levels of industrial process chains. 
Technologically, this “deep transparency” is to be achieved through the massive use of 
low-cost sensor technology. To ensure constant availability of information, state-of-the- 
art communication technologies provide the means for both networking the production 
technology and networking between the hierarchical levels of the IT architecture. From 
a business perspective, the hope is that this comprehensive networking will give rise to 
highly adaptable supply-chain networks that can organize and optimize themselves 
and thereby also form the basis for a wide range of business model innovations (cf. 
[372], [189]). 


Concepts 

Industry 4.0 encompasses various concepts that not only cover different technologies, 
but also affect various structural aspects of the organization within a company and 
between companies. 

The constant availability of information requires a smooth exchange of data within 
the entire value creation network through the integration of physical objects and IT 
systems. This gives rise to the concepts of vertical and horizontal integration. Vertical 
integration describes the creation of a coherent network for objects and systems within 
a company. All internal IT systems are interconnected via harmonized interfaces that 
serve to exchange data between a single sensor, a production machine, or the production 
planning system. Horizontal integration takes place across the entire value creation 
network. The vertically integrated IT systems of customers, suppliers, or the company’s 
own distributed sites are integrated into a horizontal system landscape. This enables 
the exchange of information in real-time across company boundaries (cf. [310], [548]). 

While the goal of constant availability of information can theoretically be achieved 
through a centralized approach, Industry 4.0 explicitly recognizes the geographically 
distributed nature of the value creation network. This is covered by the concepts of 
decentralized control and autonomous behavior of physical objects. These concepts 
encompass two important aspects. On the one hand, the disruptive dissolution of 
centralized, rigidly planned control systems is necessary simply from the point of view 
of transmission and computing power due to the large amounts data that are to be 
delivered in real-time. On the other hand, decentralized control enables the use of 
autonomous, automated decision support systems. Based on the available data, these 
systems should support human decisions or make them partially autonomous (cf. [617]). 

The digitization of all physical objects is reflected in the concept of Cyber-Physical 
Systems (CPS). These are created by combining a purely physical system with computing 
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power, memory and a communication interface. In the context of Industry 4.0, this 
combination enables the creation of smart products, production tools and machines as 
part of the production process. CPS are expected to generate very large amounts of data, 
which in turn requires the on-demand availability of decentralized, high computing 
capacity and methods for processing and analyzing the information. 

Several CPSs are the building blocks of the concept of Cyber-Physical Production 
Systems (CPPS). These are production systems that use CPSs and newly defined in- 
terfaces between machines and between humans and machines to accomplish the 
production task. CPPS are designed to enable decentralized, semi-autonomous, cross- 
enterprise production control (cf. [437], [617]). 

Another concept related to Industry 4.0 is end-to-end digital engineering. It encom- 
passes the digital mapping of the entire physical production process in a company, 
from product development to product completion. All planning, control and monitoring 
processes are constructed and simulated in a virtual environment. The basis is a digital 
image of the factory with all its physical objects. This includes production facilities, 
personnel, products and other working and operating resources that are present in the 
real factory (cf. [617]). 


Maturity Levels 

The idea of Industry 4.0 is often reduced to a visionary description that represents an 

already fully realized implementation of all concepts. Although almost all the technolo- 

gies that would be necessary for successful implementation are already available, it 
is often only the right combination of approaches and solutions that brings positive 
results for a company. 

The “Acatech Industrie 4.0 Maturity Index” describes a maturity-based development 

ae towards a fully developed Industry 4.0 capability (cf. [584]): 

The first two development levels are not seen as part of Industry 4.0 per se, but 
form the basis for all further levels. Computerization refers to the isolated use 
of information technologies, while connectivity refers to the networking of these 
isolated systems. Both terms cover the first and second levels of the maturity model. 

— The third level describes the ability to perceive through extensive equipment of 
physical objects with sensor technology. This enables the comprehensive collection 
of data points about interrelated processes. The aim of this stage is to generate the 
ability to create an up-to-date digital model of reality at any time. 

— The fourth development stage describes a state of transparency in which, building 
on the third development stage, interrelationships and causes become comprehen- 
sible through the analysis and interpretation of these interrelationships. 

— The fifth level describes the presence of forecasting capability. For this purpose, 
the system is equipped with the ability to simulate dynamic processes. On this 
basis, automated simulation experiments can be carried out that depict different 
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future scenarios. The evaluation of the experimental results leads to the forecasting 
capability of the system. 

—  Atthe sixth level, adaptability, the forecasts are used to derive decisions indepen- 
dently and implement suitable measures automatically. 


These six developmental levels represent a development path for the Industry 4.0 
capability of a company (see Figure 3.1). In addition, the development levels provide a 
reference point for classifying existing Industry 4.0 technologies (cf. [584]). 
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Fig. 3.1: Maturity Index for Industry 4.0. 


3.1.3 The Role of Al in Industry 4.0 


From the perspective of Industry 4.0, AI technologies can be used to enable a company 
for reaching the higher levels of the maturity index. AI technology is intended to enable 
technical systems to autonomously perceive their environment based relying on the data 
available to them. In particular, it can be used to locate and identify physical objects 
and determine their state. The subsequent interpretation of the perceived objects and 
the meaning of their state as well as the relationships between the objects shall help to 
achieve the necessary transparency for reasoning about industrial process chains. The 
automatic creation and updating of simulation models for an automated continuous 
prediction system, as envisioned for the fifth maturity level, could be supported using 
certain AI methods such as imitation learning. 
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While AI is sometimes understood in its function as a technological form of human 
decision-making capability, in the context of Industry 4.0 it is not necessarily intended 
to directly copy human behavior. On the one hand, Al is intended to achieve the classic 
goal of automating industrial processes: cost reduction, time savings, quality assurance, 
etc. On the other hand, AI technologies are expected to manage the emerging complexity 
of the vast amounts of data coming in from the newly deployed Industry 4.0 sensor 
systems. To some extent, AI systems may solve problems, discover unexpected issues, 
or reduce complex situations by uncovering associations in the data hidden to the 
human eye. 


The Data Crisis 
However, AI technologies are not “magic fairy dust”, but are limited by the underlying 
data and mathematical models used to try to solve a specific problem [377]. 

One of the most important engineering issues for the successful use of AI technology 
in Industry 4.0, especially machine learning methods, is therefore the management 
of the underlying data collection and its representation in a database. Generating an 
error-free and uninterrupted data stream is a prerequisite for successful operation. 
If the data is inconsistent and of poor quality, the digital images of physical objects 
cannot represent the true physical state. Predictions and decisions made on the basis 
of such a virtual model would then also be of poor quality. 

The design of any Industry 4.0 data stream begins with developing a fundamental, 
ontological notion about the true nature of the physical object in question. Critical 
factors are the physical attributes of the object and the selection of sensors as well as 
the programming of the transformation of sensor data into semantically correct obser- 
vations. The overarching challenge is that in many places the tools to adequately assess 
these aspects are still lacking. In particular, in many cases 80 % of the time building a 
system is spent on data collection and cleaning, a situation that has been called the data 
crisis [377]. “The main reason for the data crisis is the increasing interconnectedness 
of computers. Access to data is therefore easy and cheap, but its quality is often poor. 
What we need are cheap data streams of high quality. This means that efficient methods 
for improving and verifying data quality must be developed.”[377] This crisis becomes 
even more apparent when considering continuous data acquisition by a multitude 
of sensors, as envisaged for Industry 4.0 systems. Challenges include monitoring the 
quality of the data stream throughout the life cycle of an Industry 4.0 system and the 
ability to respond appropriately to changes in the underlying assumptions made at the 
beginning of the design process. 
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3.1.4 From Digital to Cyber-Physical Twin 


In response to the data crisis, the development of a system should more closely link 
the design of data acquisition with the design of the physical objects concerned. This 
results in the concept of the digital twin , which combines the connectivity of a CPS 
with the virtual representation of objects in end-to-end digital engineering. The concept 
of the digital twin focuses in particular on the quality of data transmission between the 
physical and the virtual counterpart. 

Like the Industry 4.0 maturity index for companies, the literature on digital twins 
knows different implementation levels for the digitization of physical objects [354]. The 
development stages are differentiated according to the degree of automation: 

— A digital model has a purely manual data transmission. It can be, for example, a 
simulation model of a planned factory, a mathematical model of a new product or 
another model of a physical object that is stored digitally. 

- A digital shadow automates the data flow between the physical and the virtual 
object. Any change in the state of the physical object is transferred and applied to 
the state of the digital object. 

— The digital twin eliminates all manual data transfers. It extends the concept of 
digital shadow by automating the flow of data directed to the physical object. 


In order to be able to map data automatically onto the virtual object, correct identi- 

fication of the physical object is a prerequisite for both the digital shadow and the 

digital twin. In addition, the complete elimination of manual data transmission makes 

the design of data reception at the physical object particularly relevant. Necessary 

conditions for the digital twin can be described as follows: 

— The data of the virtual object must be able to be received and processed automati- 
cally at the physical object. 

— A physical object can only be part of a digital twin if it can be perceived separately 
from other objects. 

— A perceived physical object must be uniquely identifiable so that the captured 
sensor data can be assigned to the correct virtual object. 


The concrete technical implementation of the identification can vary. It can be done by 
observation, e.g. by a scanner reading 2D barcodes. It can also be done by communica- 
tion, e.g. via radio transmission, where recognition and identification result from the 
defined standards of radio transmission, i.e. how receiver and transmitter can identify 
each other is defined by communication protocols. 

Many of today’s Industry 4.0 applications can be classified under the concept of 
the digital shadow. One example is a low-cost tracker that is built into the bottom of 
a pallet and records data about the pallet’s current location, movement, impacts and 
temperature profile. This data is sent over the cellular network to a server where the 
corresponding digital shadow is located (cf. [354]). 
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Cyber-Physical Twin 

Due to the very general metaphor, digital twins can vary greatly in size and complexity. 

Therefore, the application areas and the possible types of digital twins can be very 

broad. In particular, there is no distinction between individual physical objects and 

large, complex systems. The term digital twin can describe a system that includes only 

a small object or an entire factory. In the case of a large system such as a factory, a 

central, monolithic approach in the form of a central database would reach its obvious 

technical limits. 

For use in an Industry 4.0 environment, a special class of digital twins is therefore 
required that provides for a decentralized, modular architecture based on individual 
physical objects. Their networking should enable scalable, adaptable systems that sup- 
port autonomous behavior. This new extended concept can be called a cyber-physical 
twin . It has the following additional characteristics: 

— The physical object is considered to be a self-contained and relatively small entity. 
Its physical extent can be confined to a finite space, and it has a definite location. 
It is fundamentally mobile, even if it remains in a particular place for a long time. 

— The transmission path to the physical object is primarily for changing its behavior 
rather than issuing direct commands. All behavioral changes are first made to the 
digital object, with the physical object adjusting its internal behavior accordingly. 
The behavioral changes can be made by transmitting parameters, compiled source 
code, or other behavioral data (e.g., a trained neural network). 

— The physical object acts largely autonomously and makes local decisions where 
possible. 

— The physical object monitors its environment and can trigger an update process 
in the virtual image when situations arise in the physical world that require it to 
adjust its behavior. 

-—  Cyber-physical twins are designed as multi-agent systems that communicate in 
both the physical and virtual worlds. 


Figure 3.2 shows the concept of two cyber-physical twins negotiating with each other 
in the physical and virtual environments. Note that communication between the envi- 
ronments takes place via an exclusive link between the physical and virtual objects to 
ensure synchronization of the twins. 
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Fig. 3.2: Cyber-physical twin concept 


Although not all cyber-physical twins are controlled by AI, the challenges for software 
and hardware engineering are similar, as high-quality data streams are required in 
both cases. 

The mobility of the physical object is required for the adaptability of an Industry 
4.0 system, as it can then change the arrangement of its components. It follows that the 
embedded computer system of the physical object is in principle resource-constrained, 
while the computer system in which the virtual object runs is considered unconstrained. 


3.1.5 An Excursion into Logistics 


Since the introduction of Industry 4.0, the area of logistics has been considered its 
outstanding application domain. In no other area of industry is such a fundamental 
change expected in the near future. On the one hand, many of the central technical 
and social challenges are directly or indirectly linked to logistics and efficient supply 
chain management. On the other hand, this is due to the rapid development of Industry 
4.0 technologies. In addition to global data processing capabilities, resource-efficient 
sensor hardware, communication technologies, and embedded systems are increas- 
ingly usable on a large industrial scale. This enables widespread rollout at the level of 
individual logistics objects (cf. [164]). 

In the context of an increasingly volatile production and trade environment, the 
topology of logistics networks and thus the location of an individual logistics node, such 
as a transshipment point or a distribution center, can no longer be permanently fixed. 
In fact, the idea of a fixed, ideal location has not been viable for many years. A logistics 
network and its nodes must constantly adapt to new circumstances. Therefore, logistics 
centers should be able to relocate on their own in the future. This rules out many forms 
of traditional, technical infrastructures and underscores the need to introduce new 
concepts such as the cyber-physical twin (cf. [279]). 

To an increasing extent, swarms of autonomous vehicles will take over intra- 
company transport. In production systems, the arrangement of workstations can thus 
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be changed at any time. The vehicles’ virtual software agents negotiate orders and 
business processes, while the physical software agents negotiate movement paths and 
constantly exchange their locations and those of new stations or storage locations. 
Autonomous vehicles are able to approach the racks and move bins or pallets in and 
out. There is virtually no stationary conveyor system in this vision. 


Fig. 3.3: Cyber-physical twins managing warehousing tasks in a supply-chain 


Even the shelf and each bin within it can become part of a cyber-physical twin. The 
bins in the warehouse handle inventory management, check minimum stock levels, 
and order replenishment (see Figure 3.3). They communicate with shelf displays and 
transport vehicles. The classic, RFID-based “Internet of Things,” as it was devised at 
the turn of the millennium, is literally getting eyes, ears, arms, and legs. 

Direct challenges quickly arise from such all-encompassing networking of physical 
and virtual objects. Although the absolute amount of data has been increasing signifi- 
cantly in all areas of the economy for a long time, the massive increase in the collection 
and processing of logistics process data is becoming one of the central challenges to be 
mastered. The potential for AI applications in cyber-physical twins handling complex 
logistics processes is significant. From predicting arrival times in transportation and 
production logistics to dispatching logistics networks and the coming high-frequency 
logistics, distributed AI solutions can automate and autonomize complex processes 
that were previously closed to classic control methods. It is expected that Als will be 
able to learn how to cope with the complexity of a large number of cyber-physical twins 
representing logistics objects. 
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3.1.6 Resource-Constrained Smart Objects That Can Move 


The contributions in this chapter address four key areas associated with physical ob- 
jects in Industry 4.0 systems: the creation of things, their localization, the resource 
constraints of objects once they become mobile and smart, and their behavior with 
respect to each other. 

The first three contributions deal with the genesis of a thing long before it be- 
comes intelligent or autonomous. Using the example of steel production, machine 
learning methods are used to investigate how the act of physical separation creates 
a self-contained entity from a primordial mass and how this separated thing can be 
assigned to a concrete class by determining or predicting certain properties. An early 
prediction of undesirable developments enables a controlling intervention in the pro- 
duction process (Section 3.2). The second contribution is dedicated to challenges for 
machine learning techniques that arise in the context of changing, physical aggregate 
states (Section 3.3). Combining expert knowledge and collected data in simulation is 
the topic of the third contribution (Section 3.4). 

The fourth contribution deals with the automated localization of things—a basic 
prerequisite for a continuously (self-)perceiving infrastructure that supports flexible 
(self-)control of processes. Industry 4.0 thus requires a comprehensive ability to observe 
physical object movements. Wireless ultra-wideband localization is an example of the 
technical solution to this task in indoor environments (Section 3.5). In this case, the 
radio medium is the limiting resource that affects the accuracy and scalability of the 
application. The contribution is related both to the smart city concept (see Chapter 4), 
which deals with a cross-location infrastructure based on future wireless technologies, 
and to communication networks (see Chapter 5). 

The fifth contribution is motivated by smart things that can move. Free mobility is 
at its core a deeply logistic property (“the ideal logistic space is empty”). It is also the 
main cause of resource constraints in terms of energy supply, computational capability 
and communication for embedded systems. The topic of “Indoor Photovoltaic Energy 
Harvesting” (Section 3.6) deals in depth with the requirements for ultra-low power 
devices and their modeling. An important application of this technology is the stan- 
dardized logistics container, which becomes smart by an embedded ultra-low power 
computer system. Such smart containers are destined to be one of the key building 
blocks of Industry 4.0: they are the external interface for the non-smart things that they 
contain. 

The last contribution deals with the behavior of smart and in principle au- 
tonomously moving things. It is based on the development of a micro-UAV drone 
swarm (Section 3.7). The resource constraints of the small drones place high demands 
on the system architecture. The contribution describes the setup of a testbed envi- 
ronment in which simulation and the physical world are tightly coupled. The drones 
negotiate their individual movements through simple rules that mimic natural swarm 
behavior. It is shown how individual rules can be replaced by knowledge learned in 
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a simulation. The drone swarm testbed illustrates the challenges of developing au- 
tonomous swarm systems for Industry 4.0. Since these systems will increasingly be used 
across companies in business-critical areas in the future, the issue of data protection 
in multi-agent systems will also become important in this context (cf. Section 6.1). 
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Abstract: Interlinked manufacturing processes are characterized by the dependence of 
downstream process steps on the previous ones. If it can be predicted that a particular 
workpiece will not reach the desired quality, anticipatory measures can be taken early 
in the process. By its prediction, machine learning saves resources, both in processing 
and in the material. One example is the model-based quality prediction in electron- 
ics manufacturing on a Surface Mount Technology (SMT) production line. Here, the 
application of a learned classifier predicting the quality must be fast so that, say, the 
routing of a piece may be changed. Hence, machine learning itself needs to save its 
resources, in this case runtime. Another example is the hot rolling mill process for steel 
bars production. There, several sensors and process parameters deliver process data 
that need to be aligned and useful features are to be extracted from the resulting stream 
automatically, in real time. Here, machine learning saves testing time in the factory’s 
quality inspection process. 


3.2.1 Introduction 


Undetected quality deviations passing through the entire manufacturing chain have 
a severe impact on internal failure costs due to the increasing rejection and rework- 
ing of defective products without being labeled as defective. Therefore, early quality 
prediction of a specific workpiece indicates whether it will reach the required quality 
requirements or if some anticipatory measures should be executed in a timely man- 
ner to save resources (e.g. time, material) resulting in rejection or further processing. 
However, due to technological and temporal restrictions, physical product quality in- 
spections are limited to the final process step. In this context, data mining and machine 
learning techniques can be used to predict the intermediate product’s quality, thus 
gaining transparency on quality properties of intermediate process steps and enabling 
real-time process adaptation to sustainably increase its efficiency [339, 654]. 

In general, process modeling can be done on three levels [654]: process under- 
standing; designing better processes and equipment; and online process control in 
real time. Online control requires integrating data from different sensors at different 
steps in the process, taking into account the communication between sensors and even 
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integrating different models in real time. Over the last decades, the majority of works 
were focused around the first two levels, where models based on domain knowledge 
have been developed. More recently, with the emergence of the “industrial Internet 
of Things” or “Industry 4.0” (see Section 3.1), the adoption of different monitoring 
technologies using sensor technologies [155, 156], has offered new opportunities for 
applying data-driven approaches for process modeling, in general, and for intermediate 
quality prediction, in particular. 

This section is about making Intermediate Quality Prediction (IQP) along a process 
chain by embedding data analysis directly into the manufacturing chain. We start by 
showing how IQP using data mining and machine learning can be integrated into 
a comprehensive Intelligent Manufacturing Process Control (IMPC) framework for 
industrial applications. Section 3.2.3 dives deeply into the steps of data analysis. These 
steps include a detailed description of the data acquisition process, cleansing, the 
choice of data representation, extracting the right features, , modeling, and evaluation. 
A framework for processing data streams with the right level of abstraction in real-time, 
in addition to the real-time management of many machine learning models, is also 
presented. It glues diverse contributions of data mining and modeling together to form 
an application. 

We present two real-world case studies in Section 3.2.4, starting with the description 
of the production process and data acquisition followed by modeling and deployment. 
The first case study addresses the use of data mining in an electronics production 
environment for the purpose of reducing the quality inspection volume. The case study 
is conducted on a Surface Mount Technology (SMT) manufacturing line in the Siemens 
plant in Amberg. The motivation is to relieve the optical end-of-line test, consisting of an 
X-ray inspection system. The second case study consists of a hot rolling mill process. It 
showcases the importance of embedded data analytics. The system is developed in close 
collaboration with experts in machining, production, and the steel mill. Data analysis 
results are validated by domain experts. The conclusion summarizes our findings and 
gives an outlook for future research work. 


3.2.2 Intermediate Quality Prediction in Intelligent Manufacturing Process Control 


Different quality-related tasks for the application of data mining and machine learning 
in manufacturing can be distinguished [335]: description of product/process quality; 
modeling of the product quality; quality prediction; and parameters/process optimiza- 
tion. 

In literature, there are several standardized procedures for the implementation of 
data mining or machine learning in general and one approach for optimizing quality- 
related tasks in particular. Four widely used process models include the Knowledge 
Discovery in Databases (KDD), the Cross-Industry Standard Process for Data Mining 
(CRISP-DM), the 5-step SEMMA model by SAS, and the 5A model by SPSS [121, 201, 484, 
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668]. The general framework for continuously monitoring and optimizing a process 
under quality considerations is known as IMPC [339]. A characteristic feature of the 
models mentioned is that they are very similar in terms of their basic objectives and 
structure. The procedures are subdivided into distinct iterative phases that differ in 
terms of the number and the content of the respective phases. While the IMPC solely 
focuses on the technical perspective of a data mining project, some other models include 
a phase for elaborating the business cases, which indicates the necessary involvement 
of expert knowledge in every data science project. 

Product or process quality description is usually the first step to be performed, 
especially in the context of highly complex manufacturing systems with non-linear 
interactions between the different process steps [579]. This model can be based ona 
physical model designed by domain experts, or it can be a data-driven model using 
machine learning techniques. The model is subsequently used to predict the quality of 
new unseen products. Following these predictions, a variety of measures for process 
optimization can be applied [405, 579, 654]. These measures include early control 
interventions, optimization of process parameters setting, stabilization of processes, 
dynamization of inspection plans, and the design of model-based inspection processes. 

The IMPC introduces data mining techniques to perform IQP and adjust further 
production processing steps according to the results of the IQP. The IMPC consists of 
different functional modules that can be summarized as follows [339]: 

1. Data acquisition and storage 
2. Intermediate Quality Prediction-IQP 
3. Process optimization 


While the first module seems to be a prerequisite for building the IMPC, modules 2-3 
represent the different process control stages. 

The IMPC itself can be realized following two different paradigms. The first one is 
based on an online optimization of the process based on the current observed state of 
the process in order to compensate for previous process deviations. Such a decision is 
made following the domain knowledge provided by production experts and engineers. 
The second one is a data-driven approach where the first and second modules are 
integrated into the decision support system. The second type of process control relies 
on estimating the quality of the intermediate product in real time. Anticipatory measures 
are taken to prevent possible predicted process deviations. The modular design of the 
entire process control approach is detailed in Figure 3.4. 

The IMPC concept can be viewed as a separate building block in a company’s 
process control landscape [339]. The IMPC does not introduce any change in the process 
or production structure. However, it generates recommendations in process planning 
and optimization. Therefore, the focus is brought on analyzing processing states in real 
time and forecasting the intermediate product’s quality properties. This knowledge is 
used for either deriving recommendations on whether the product should be processed 
any further or for stopping the processing of the product early, since the final quality 


3.2 Quality Assurance in Interlinked Manufacturing Processes —— 117 


Data Acquisition and Intermediate Quality 
Storage Prediction-IQP 


Data preprocessing 
and feature 
extraction 


Model building Process optimization 


ak 


Process 
parameters 
optimisation 


C 


Fig. 3.4: Intelligent manufacturing process control model. 
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requirements will not be met, or for optimizing the next processing step’s parameters 
to align the product with the required quality standards. 

The IMPC requires information on the current processing state and parameters that 
are gained from the first module, in addition to information on historical processing 
data. Data is most often collected from sensors implanted to monitor some process 
variables. Collected data has to be preprocessed and meaningful features should be 
extracted to be fed afterwards into a quality prediction model in order to assess the 
intermediate product’s quality correctly. Hence, the Intermediate Quality Prediction 
IQP module translates all available information irrespective of its actual meaning into 
a quality assessment. This is done by means of data mining, i.e., supervised learning 
models [339, 654]. The result of the IQP module can then be applied to decision rules or 
recommendations on process optimization. The process optimization module is the 
final step in implementing the IMPC model. As described above, it aims at adjusting 
the parameters of upcoming process steps in such a way that quality deviations caused 
by previous processing steps are compensated in the remainder of the process chain. It 
is worth mentioning that when it comes to decision rules following the results of the 
prediction of intermediate products’ quality, even more knowledge is required, includ- 
ing domain knowledge and artificial test-beds (e.g. process simulations) to validate the 
correctness and feasibility of the process optimization recommendations. 

In the following sections, we will discuss in detail available methods for data 
analysis and quality modeling using machine learning. 


3.2.3 Methods for Data Mining from Sensor Data 


The dynamic evolving nature of manufacturing processes poses multiple challenges 
for data mining from sensor data [654]. For instance, different scenarios may occur, 
sensors and machines may fail, some deviations in processing steps could be observed 
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and products may take different routes through the process chain. With such events, 
difficult challenges on how data should be collected, stored, and preprocessed arise. 
Additionally, adaptive extraction and selection of which features may be relevant for 
the prediction task, are necessary. Moreover, to generate intermediate products’ quality 
predictions in real time, stream data mining methods and online management of many 
machine learning models should be established. The following sections will discuss 
such problems and available methods in more detail. The structure of the section is 
depicted in Figure 3.5. It is based on the IQP process steps and comprises standardized 
procedures that are common steps in every quality-related data-mining task. 
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Fig. 3.5: Specification of the IQP process with regard to common steps in quality-related data mining 
tasks. 


3.2.3.1 Data Acquisition and Storage 

In intelligent manufacturing systems, data is most often collected using various sensing 
technologies [339, 405, 654]. The continuous measurements of sensors form an indexed 
stream of countably infinite data items x;. Every data item contains an index, e.g., a 
timestamp, and can contain an arbitrary number of values and value types such as 
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images, strings, or numbers. A segment of this stream with length n and only numerical 
values forms a value series that can be defined [445] as amapping x : N > RxC™;m €N, 
where each element x; of a value series with length n is an ordered pair (di, w;).d; € N 
is called the index component and w; € R x C™ the value component. It is called a 
time series if the index dimension represents a temporal order. The value component 
is written using a complex number value space instead of a real number value space 
in order to formalize sensor readings and their transformations in the same form. In 
classification tasks (e.g. quality prediction) the goal of a machine learning model is to 
learn a functional mapping f(x) : R > y : N", where y formally denotes the class label 
of the process measurements. In general, this task is named a multi-class classification 
task. The special case of binary classification occurs when the class label consists of 
only two complementary classes, say, the Not Ok (NOk) class and the Ok class. 


3.2.3.2 Data Preprocessing 

In manufacturing environments, sensors may deliver wrong readings or might expe- 
rience failure periods [474]. Their readings are most of the time noisy. Henceforth, 
collected data may contain irrelevant readings, be wrongly aligned, or have different 
resolutions. In an offline setting, such cases can simply be excluded from the analysis 
or corrected. By contrast, the embedded real-time analysis of data must somehow de- 
tect such cases in an automated fashion and react accordingly. The first analysis step 
therefore usually consists of cleaning the sensor data. 


Detection and Handling of Faulty Sensor Readings Faulty sensor readings such as 
sensor readings lying outside physically meaningful ranges need to be detected. Nev- 
ertheless, faulty readings may overlap with the normal data, requiring the automatic 
detection of faulty patterns using supervised learning techniques (see Section 3.2.3.5) 
[654]. However, if the faults are not highly frequent, it is difficult to detect them based 
on available training examples. Models for anomaly detection can be of great help in 
this context by describing only the normal data. They mark patterns deviating largely 
from the distribution as anomalies. Nevertheless, the correct definition of parameters, 
like threshold values, remains difficult with only a few noisy examples. Furthermore, it 
can be difficult even for domain experts to identify such examples correctly. There are 
several possible ways on how to handle missing values after being detected including 
replacement by their predecessor values or auto-regressive moving average approaches 
[98] or imputation based on predictive models that are trained on other existing val- 
ues. Additionally, the production process itself might introduce some level of noise 
to sensor data. If the underlying noise model is known, it should be used. Otherwise, 
measurements should be filtered and smoothed [405, 654]. 


Detection and Handling of Changes Deviations and changes in sensor readings 
may also be explained by intended changes in the underlying hardware, such as when 
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new production equipment and new or different calibrations of sensors are introduced. 
Based on the type of changes, it must be decided which of the trained models need to be 
updated and how. Without any models describing such changes, methods for concept 
drift detection can be applied and it has to be decided if already trained prediction 
models must be updated and how these models should be managed [564]. More details 
are provided in Section 3.2.3.10. Incremental training methods, like streaming methods 
(see Section 3.2.3.9), can be used to integrate such changes into the trained models 
automatically. 


Different Alignments and Calibration Different data alignments can be observed 
due to different data sampling strategies and sampling frequencies from sensor record- 
ings. The measurement frequency of a data source is defined by the number of times the 
source delivers data per time unit. The higher the desired sampling frequency, the more 
memory it uses. The association and integration of data from many sensors require 
their synchronization with the environment description. The environment description 
is characterized by the adequate time of measurement. To obtain a precise synchro- 
nization, a sufficiently accurate global time of measurement for the different sensors 
is required to be defined or derived using synchronization techniques [309]. In this 
context, a multi-sensors data fusion system has to cope with different and varying 
measurement frequencies, measurement latencies, and asynchronous measurement 
times. 

Synchronization techniques can be divided into two main families: deterministic 
and non-deterministic. In a deterministic setting, the measurement times of each sensor 
have to be known in advance and synchronization is performed on the slowest sensor 
using aggregation techniques [309]. In a non-deterministic setting, sensors are assumed 
to be asynchronous and there is no knowledge about measurement times or latencies. 
In the process recordings, different frequencies might be used. In such situations, 
recursive filtering approaches such as the Kalman filter or recursive autoregressive 
filters can be used for synchronization [291]. 

In addition, different hypotheses for drawing data samples from sensors may also 
lead to a mismatch in the sense of learning from different underlying distributions. 
Therefore, sensor data has to be calibrated (e.g. adjustment of sensor parameters, 
features, raw data cuts for out-ranging measurements using classification or clustering 
techniques) [654]. 


3.2.3.3 Data Representation 

As it will be shown in the case of steel production (Section 3.2.4.2), a single run through 
the process chain can be represented by a set of time series with different lengths and 
offsets, which may overlap in time, contain different numbers of segments at different 
levels of granularity, stem from different machines, and sensors and may also be entirely 
missing for optional processing steps. 
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Many common data analysis methods cannot work directly on such sets of time series; 
rather, they require that all observations to be represented by fixed-length feature 
vectors. Here, we discuss how the raw series values can be transformed to fixed-Length 
vectors and concatenated into one single representation. 


Mapping of Value Series to a Fixed-Length Vector and Concatenation The goal 
is to transform raw sensor data into fixed-length vectors in order to make standard 
learning techniques applicable. We start by reserving enough space for the records of 
each sensor in a single fixed-length numerical vector. Original series values are then 
indexed by predefined positions in this vector. For the mapping, the maximum length of 
the time series needs to be known beforehand. For the projection, the original time series 
values might need to be rescaled, e.g., by interpolation. As a result, the application of 
the most popular distance-based data analysis algorithm becomes possible. 

A difficult question is which values to assign to portions where no processing took 
place. We may simply fill missing portions with zeros or the last recorded value. However, 
filling with zero values can lead to inaccurate evaluation with several popular distance 
measures, including Euclidean distance. For example, both series would be marked 
as highly dissimilar by Euclidean distance, although both blocks (A and B) can have 
a similar quality. In such a case, the correct mapping between similar feature vectors 
and similar labels would be altered. Reserving the same portion for both finishing rolls 
(sensors 5 and 6) in the fixed-length vector seems to solve the problem, but it does not 
take into account that both finishing rolls might have different properties, e.g., value 
scales, which usually require a careful data calibration [654]. Instead of transforming all 
series values to a fixed-length vector, another option would be to use distance measures 
that can handle value series with different lengths, such as the Dynamic Time Warping 
(DTW) [461] or the Longest Common Subsequence (LCSS) distance [160]. In principle, 
two main approaches, for transforming the original time series appropriately, exist. 
The first approach simply concatenates all time series belonging to the processing of 
single manufactured pieces (e.g. a single steel block). The second approach consists of 
calculating distance values for each time series of each sensor independently and then 
summing them up to a total distance. 


3.2.3.4 Feature Extraction and Selection 

Instead of using a raw data stream, we can characterize production processes by a 
devised set of features extracted from raw data. The transformation of the raw data into 
a feature vector should maximize the prediction accuracy and increases the resource 
awareness of the machine learning algorithms by reducing the dimensionality and the 
amount of stored data. (See Chapter 1 in Volume 1.) The challenge in this task consists 
of the huge search space (i.e. exponential size) over all the possible transformations. 
Traditional approaches are based on manual feature extraction to build features one 
at a time using data analysis and domain knowledge. While these approaches are 
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accessible and interpretable to users, they are tedious, time-consuming, error-prone, 
and usually not adaptive to data changes. In this context, automatic feature engineer- 
ing appears to be a promising way to go. It consists of automatically extracting and 
generating a large number of features and selecting an effective subset of these features 
to ensure better performance of the machine learning algorithm. Many works have been 
conducted for automated feature engineering in literature either by automating the 
whole process [594],[445], or by focusing on particular steps such as feature extraction, 
feature generation [314] or feature selection [582]. 

The following sections present a non-exhaustive list of transformation and fea- 
ture extraction methods that look especially promising in the context of production 
processes. 


Aggregation and Summarization Aggregation and summarization methods for 
streaming time series data reduce the amount of collected data as much as possible 
and try to retain its most important patterns. The simplest type of aggregation is based 
on the calculation of summary statistics such as minimum and maximum values, the 
mean, median, standard deviation, percentiles, and histograms. Such simple features 
can already encode sufficient information for the learner and sometimes outperform 
sophisticated methods [654]. More sophisticated methods search other representations 
of the time series based on time series transformations. These include the Discrete 
Fourier Transform (DFT) [445] and the Discrete Wavelet Transformation (DWT) [445]. 


Segmentation The salient features approach by Candan, Rossini, Sapino, and Wang 
[115] transfers ideas from the segmentation of two-dimensional images and the ex- 
traction of Scale Invariant Feature Transformation features (SIFT) from images to the 
space of one-dimensional value series. Salient points in the series, which are points 
that deviate much from their surrounding values, are used for segmenting the series. 
Then, from each segment, characterizing features are extracted. The method deter- 
mines salient points at different resolutions, allowing for a description of value series 
at different levels of granularity. One segmentation approach was developed in the 
context of the steel production use case (Section 3.2.4.2), which determines segments 
based on domain knowledge and signals from machines in the process chain [654]. 
After segmentation, different statistics are computed on the segments such as the mean, 
the standard deviation, and the minimum and maximum values. Other features are 
differences between values and histograms. The biggest advantage is that the approach 
allows for the combination of many features in a highly interpretable manner [654], 
since the feature transformation is handled in a multivariate manner in the sense that 
features from different value series and their parts are combined together. For example, 
a classification rule that is formed based on such features may be: “Predict a steel block 
as defective if it was heated for less than one hour at 900 degrees and if the maximum 
rolling force in the first rolling step exceeds the value of 10 000”. The approach has 
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already been used successfully for the identification of coarse-grained patterns such as 
processing modes (see Section 3.2.4.2). 


Symbolic Representation SAX (Symbolic Aggregate Approximation) [413] first deter- 
mines the elements of a sequence C = (c1; +++ ; Cn) by piece-wise aggregate approxima- 
tion and maps them to a new sequence C with w < n: 


Cj (3.1) 


The elements c; are then discretized by mapping them to a fixed number of symbols, 
keeping the upper bounded Euclidean distance between all series. A gradient-based 
approach for the symbolization of streaming sensor data was introduced by Morik and 
Wessel [458]. This approach was originally applied in the context of text mining and 
has been successfully used in areas such as text classification or intrusion detection. 


Method Trees All of the aforementioned methods, with different parameter settings 
and combinations, are useful for extracting functional features from value series. In- 
stead of trying and combining all the existing methods and different parameter values 
manually, Mierswa and Morik [445] developed an automatic representation learning 
method that optimizes the composition of a representation for best classification learn- 
ing. Basis transformations, filters, mark-ups, and a generalized windowing present 
elementary methods that are combined in the form of a method tree. The tree applies 
the operators (nodes) in a breadth-first manner to transform the original value series. 
The root of each tree consists of a windowing function, while the children of each parent 
node are operator chains representing basis transformations, filters, and a finishing 
function. Learning the feature extraction tree is performed by a genetic programming 
algorithm. The method can be used for the analysis of time series from production 
processes. However, its demand for stratified datasets with respect to the labels is 
not always met by the production data. The method had been implemented within 
RapidMiner. 


3.2.3.5 Modeling 
The following sections give a short overview of widely used methods that are assumed 
to be of special relevance for the embedded data analysis in production processes. 


3.2.3.6 Supervised Learning 

The most popular task in the context of Supervised Learning is inferring functions 
representing the relationship between a set of explanatory variables (i.e. features) and 
a response target variable which can be discrete (i.e. classification) or continuous (i.e. 
regression), also known as function learning. Let X be a set of possible explanatory 
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variables, Y bea set of possible target values and D an unknown probability distribution 
on X, Y. Let further H be a set of possible functions. Given examples(x, y) € X x Y, 
drawn from D, where y = f(x) for an unknown function f, the goal is to find a function 
h € H : X > Y, such that the error erp(h, f) is minimized. 

In the case of a production process, explanatory variables are feature extracted from 
sensor measurements (i.e. records for a given steel block) and y € Y is the corresponding 
quality label to be predicted. Examples of learning functions h € H include Decision 
Trees, K-Nearest Neighbour (kNN), Naive Bayes and, Support Vector Machines (SVM). 
For a detailed in-depth overview of such methods, see Hastie et al. [260]. 

The aforementioned methods assume that all training data is available in batches. 
Hence, they cannot be trained during the manufacturing process, and their models 
require to be retrained if concepts change. In case of the absence of concept drifts, 
models can be trained offline but deployed online for the prediction. 


3.2.3.7 Unsupervised Learning 

If no labels are provided, unsupervised learning methods may be employed to reveal the 
most prominent patterns in the data. Cluster analysis [300] tries to group observations, 
following a similarity measure. The number of clusters k is usually a user-defined 
hyper-parameter. A well-known clustering algorithm is k-means [426]. Dimensionality 
reduction techniques such as principal component analysis (PCA), aim at simplifying 
high-dimensional datasets. Some of them may be used for better data visualization, 
like SOMs [654], which map high-dimensional input vectors to a low-dimensional grid. 
Vectors that are similar to each other in the input space lie close to each other on the 
grid. SOMs have also been used for analyzing the data in the steel production case 
study (see Section 3.2.4.2). The biggest disadvantage of unsupervised methods is that, 
without any labeled data, their results can only be validated by domain experts. 


3.2.3.8 Learning in the Presence of Class Imbalance 

Class imbalance occurs when data classes are not equally frequent. Generally, it occurs 
when some classes represent rare events, while the other classes represent the coun- 
terpart of these events. Rare events, especially those that may have a negative impact, 
often require informed and prompt decision-making. However, the class imbalance 
is known to induce a learning bias towards majority classes, which implies a poor 
detection of minority classes. For example, production processes with high-quality 
standards usually output more high-quality goods. Similarly, certain events, like ma- 
chine or sensor failures, may only occur rarely. In such cases, many positive examples 
but only a few or even no examples of the negative class are available. Measuring 
class imbalance performance in classification tasks based on the accuracy leads to 
the problem that the metric is biased towards the majority class. Class imbalance can 
be mitigated using different methods including one-class learning, class rebalancing 
methods, and ensemble learning. 
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One Class Learning One way of handling class imbalance is to treat minority class 
instances as outliers or anomalies and majority class instances as the normal class 
using the task of one class learning [460]. Tax and Duin [675] propose a Support Vector 
Data Description (SVDD) that computes a spherical boundary around the given data 
points. The diameter of the enclosing ball and thereby the volume of the training data 
falling within the ball can be chosen by the user. Observations inside the ball are then 
classified as normal whereas those outside the ball are treated as outliers or anomalies. 
Schélkopf et al. [581] have proposed the 1-Class SVM, which separates all training 
examples with a maximum margin from the origin. An active approach to such data 
domain descriptions generalizes this approach [238]. 


Class Rebalancing Methods Several methods have been proposed to handle the 
class imbalance problem using resampling of the data [563]. Resampling strategies 
rebalance the data in order to mitigate the effect of the bias of machine learning models 
towards the majority class [125]. Resampling methods are considered to be flexible 
and are widely used since they are independent of the selected classifier. The class 
imbalance problem can also be solved using algorithmic modifications of existing 
machine learning classifiers, e.g., support vector machines, k-nearest neighbors, or 
neural networks. Modifications can be introduced by enhancing the discriminatory 
power of the classifiers towards the minority class using kernel transformation to 
increase the separability of the original training space [219]. 

Cost-sensitive learning can also be applied in the context of class imbalance by mod- 
ifying the loss functions to increase miss-classification costs of the minority samples 
[322]. 


Learning Ensembles in the Presence of Class Imbalance Combining classifiers 
in ensemble frameworks is another common approach to handle the class imbalance 
problem [563]. Within ensemble-based classifiers, we can distinguish four main families. 
The first family includes resampling based ensembles. An ensemble of classifiers is 
created after training base classifiers on balanced datasets obtained with a resampling 
technique. In the second family, the ensemble is built based on boosting [576] after 
applying a data resampling strategy (e.g. SMOTEBoost [126]). Within the third family, we 
find Bagging-based ensembles [101] (e.g. UnderBagging, OverBagging, SMOTEBagging 
[715]). 

Recently, we proposed a probabilistic ensemble method to handle the class im- 
balance explicitly at training time [563]. Unlike existing ensemble methods for class 
imbalance, which use either data-driven or randomized approaches for their construc- 
tion, our method leverages both directions. On the one hand, ensemble members are 
constructed from randomized subsets of training data. On the other hand, we design 
different scenarios of class imbalance for the unknown test data. For each of the re- 
sulting scenarios, an ensemble is obtained by combining random sampling with an 
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estimate of the relative importance of specific loss functions. The final predictions are 
generated by computing a weighted average over the individual ensemble predictions. 
In contrast to existing methods, this approach does not attempt to correct imbalanced 
datasets. Instead, it has been shown how imbalanced data sets can facilitate classifica- 
tion, given a limited range of true class frequencies. This method promotes diversity 
among ensemble members and is insensitive to certain parameter settings. 


3.2.3.9 Stream Mining 

To cope with real-time evolving production systems, every step of the analysis should be 
applied and handled in an online fashion. Whereas learning an effective representation 
needs to be run offline as shown above with model trees, using the learned extraction 
must be applied online. Many applications demand the transformation of the raw 
sensor data, the deployment of the model, and sometimes even the model update and 
the detection of novel events being performed online. Change detection is usually done 
by updating the model with every new arriving data item or learning a model on a batch 
of the most recent data items [218]. 

The biggest challenge in applying classical stream mining settings to production 
processes is, that the labels are usually measured at the end of the process chain. 
Hence, at each point in time we predict the final quality. For the time series starting 
with tj) and ending with tie, we have the observations Xj, ...Xtiqs «»-Xtie, but only the 
final quality measurement y;;.. Hence, at each observation point we predict the final 
quality. The model f uses p extracted features from x € X (i.e. f(@1(X), -+ , @p(X)) = y). 
Assume tio to be the starting time of the process pj, tie its end, and tia the actual time 
instant, one possible approach is to segment the time series and learn an individual 
model on every single segment to predict the final quality. This approach is useful when 
important events can not be identified in time and memory constraints are imposed, i.e., 
training models on time series subsequences instead of considering all the historical 
data. If it is possible to identify important events within the corresponding step of the 
process occurring at a given time instant tic, then every feature extracted from the 
segment [tjo, tic] can be used to learn a model. It is therefore possible to generate the 
first prediction of the label if tia > tic. 

A second approach consists of using a combination of static and statistical features, 
like the minimum, maximum, or average of the time series. The static data won’t change 
over the complete process and the probability of change of the statistical features will 
decrease to the end of the process. That means, that there exists an index tis, where the 
prediction error is bounded by (F(illtio, tis), +++ » Pp(Itio, tisl)) - y)’ < ë. Another 
approach would be to use one or a set of algorithms, that predict the set of features for 
the unseen part of the time series [tia, tie] and train a model on the full feature set to 
predict the quality label. The overall prediction will therefore be strongly dependent 
on the accuracy level of the feature prediction. 


3.2 Quality Assurance in Interlinked Manufacturing Processes —— 127 


3.2.3.10 Online Management of Many Models 

As it is mentioned in stream mining, machine learning models can be trained offline 
and deployed in real time to generate quality predictions. It may also happen that the 
stream-mining setting requires the training of many models on different parts of the 
data, e.g., different segments of the time series (see Section 3.2.3.9). In addition, the 
increasing individualization of products and processes may result in small heteroge- 
neous groups of observations. Learning distinct models is then necessary to represent 
each group of observations and capture its underlying properties. Therefore, efficient 
online management of many models should be established through the dynamic combi- 
nation of many models built to comply with detected changes in the data by adaptively 
changing their combination and integration rules in real time. The dynamic combina- 
tion can formally be established with adaptive ensemble methods that are built as a 
weighted combination of distributions characterizing the target concepts and enabling 
flexible management of the models from the individual model selection to the weighted 
aggregation of many models [561, 562, 564, 565]. The application of adaptive ensemble 
methods is made in connection with concept changes detection in the data that enable 
the ensemble update on different levels (i.e. base models selection [561], informed base 
models/ ensemble parameters adaption (i.e. after a detection of a concept drift) [562]). 

In this context, we have proposed an adaptive ensemble selection framework that 
manages online two main ensemble construction stages: pruning and integration [564]. 
Since the performance of ensemble-based models changes over time, it is also consid- 
ered to be subject to concept drifts. A drift detection mechanism is employed to exclude 
models whose performance becomes significantly worse compared with the remaining 
models and to identify the top base models in terms of performance. Performance is 
evaluated in this context using a custom measure based on the Pearson Correlation 
(i.e. commonly used to deal with time series data between base models forecast and 
target time series on a sliding window validation set. After each drift detection, top 
base models are identified. Since diversity is a fundamental component in ensemble 
methods, we perform a second stage selection through clustering model outputs. Clus- 
ters and top base models are updated after each drift detection (ie. whenever an alarm 
is triggered by our drift detection). At each cluster computation, the models that belong 
to the cluster representatives are selected. In a final step, the selected models’ outputs 
are combined together using a sliding-window weighted average. 

We have also developed a framework for online ensemble aggregation using deep 
reinforcement learning for time series forecasting [562, 565]. There, we leverage a 
deep reinforcement learning framework for learning linearly weighted ensembles as 
a meta-learning method. In this framework, the combination policy in ensembles is 
modeled as a sequential decision-making process and an actor-critic model, that aims 
at learning the optimal weights in a continuous action space, is used. The policy is 
updated following a drift detection mechanism for tracking performance shifts of the 
ensemble model over time [562]. 


128 —— 3 Industry 4.0 


These frameworks were developed initially to manage forecasting models and can be 
applied to predicting continuous quality-related measures over time, but can also be 
transferred to classification models operating in streaming environments. 


3.2.4 Case Studies 


The following case studies highlight different facets of the introduced theoretical frame- 
work and give a decisive impression of how the described methods can be applied in 
practice, in order to solve quality-related problems. However, not all concepts have 
been validated in the particular industrial environment. In the SMT manufacturing 
use case, the steps of data acquisition and storage, learning in the presence of class 
imbalance as well as feature extraction and modeling are explained. Lastly, the process 
optimization in the SMT line is described. In the hot rolling mill process, the data 
acquisition and storage are highlighted, since the processing of the sensory data was 
relatively challenging due to incomplete sensory measurements and other factors. Also, 
the feature extraction and modeling phase are explained. 


3.2.4.1 Model-Based Quality Prediction in Electronics Manufacturing 
The case study of electronics production covers the production of programmable logic 
controllers of the Simatic type. At the end of the soldering process, the correct position 
of soldered components is checked. This is accomplished by an Automatic Optical 
Inspection (AOI) for variants with visible connection points (pins) and by an X-ray 
inspection for variants with the pins underneath the components from two different 
perspectives (X, and X2). The Printed Circuit Boards (PCB) are placed on a panel and 
are tested in a pool test by placing 48 PCBs in eight Fields Of View (FOV). The number 
of pins on each PCB are 79 and 52 for X; and X- directions, respectively. A typical 
PCB board is illustrated in Figure 3.6 to showcase one typical product variant that is 
manufactured using the SMT technology. Due to the long inspection time as well as 
the large number of units produced, X-ray inspection is a bottleneck in the production 
of PCBs, especially when considering the 100 % inspection, i.e. the testing of every 
PCB, that is conducted and the constantly growing demand for programmable logic 
controllers [579]. 

For this case study, the focus has been narrowed to one product variant, its respec- 
tive manufacturing line, and the data sources of SPI and X-ray inspection. 


Data Acquisition and Storage Historic datasets from Serial Peripheral Interface (SPI) 
and X-ray are matched by different manufacturing databases via a unique identifier. 
Resolving the use case at the pin level, where each pin is assigned a quality label, isa 
low-dimensional machine learning task as was shown by [580] since only seven different 
features are mapped to one categorical class label. However, from a process perspective, 
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Fig. 3.6: The image illustrates a PCB board from the X; direction that is manufactured using SMT. 


the quality test is not feasible on a pin level and the process data is aggregated to the 
FOV level. The dimensionality of the process data thereby increases dramatically. The 
considered dataset consists of numeric SPI features (see Table 3.1) and a binary X-ray 
label on the aggregation level of FOVs, which is formalized as a binary classification 
task. The features characterizing each pin are summarized in Table 3.1. 


Tab. 3.1: Descriptive PCB features on the pin level. 


SPI feature Unit 


Height % 
Shape 2D % 
Shape 3D % 

Surface % 

Volume % 

Offset X pm 
Offset Y pm 


Historical datasets for a period of five production months are used for the case study. 
In total, 1461 037 321 data points are parsed, of which 800 parts per million (ppm) are 
NOk. 

Non-representative datasets, which, for example, are recorded under obsolete 
process configurations or during manufacturing trials, are eliminated using expert 
knowledge. The result is a prepared and cleaned training dataset for subsequent mod- 
eling, including a unique identifier, all relevant features and the quality label, which 
can be continuous or discrete depending on the applied measurement method. 
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Learning in the Presence of Class Imbalance The considered use case constitutes 
an extraordinary case of class imbalance. When using no specific countermeasures, the 
classifier degenerates due to frequent misclassifications of the minority class. Classify- 
ing the defect class as OK is often unacceptable in an industrial environment. Therefore, 
specific measurements are chosen in order to increase the performance of the algorithm 
on the NOK class. To rebalance the dataset, a combination of different resampling 
techniques are examined, such as oversampling (SMOTE, random oversampling) and 
undersampling (random undersampling). Furthermore, the output class-membership 
probability threshold is systematically tuned in order to achieve a better trade-off 
between both the precision and recall of the classifier. 


Feature Extraction and Model Learning Initially, the PCB process data was de- 
scribed on the pin level. Process expertise led to the conclusion that aggregation to the 
FOV level is necessary so that each FOV consisted of either 79 pins in the X, direction, 
or 52 pins in the X> direction with seven features for each pin. While [580] results on the 
FOV level were not promising, further experiments are conducted. One direction is to 
reduce the dimensionality using automatic feature extraction methods in combination 
with machine learning models. For this we use the Rapidminer extension Value Series 
Plugin. This reduces the dimensionality of the process data for X; from 553 features to 
240 features and from 364 features to 50 features for X>. 


Model Evaluation The training ofthe models takes place in a nested structure of inner 
and outer cross validation and hyper-parameter optimization. As thea priori selection of 
adequate algorithms is not achievable in a generalized way, different learning methods 
and algorithms are tested and evaluated for each individual application [579], including 
Gradient Boosting Trees (GBT), Random Forest (RF), and Multilayer Perceptron (MLP), 
which is based on fully connected neurons that compute a complete weighted sum 
in the affine transformation stage of the connections. These methods are used as a 
baseline for the experiments. Opposed to a fully connected neuron, a 1-Dimensional 
Convolutional Neural Network (1D-CNN) is based on the principle of weight sharing on 
all connection units. For example, in the discrete one-dimensional (1D) case: (d* K)(x) = 
>>, I(x - a)K(a) is a convolution of a data sample d with the kernel (or filter) K [767]. 
A 1D-CNN with a kernel size of k = 3 is used in order to integrate knowledge from the 
direct neighborhood of pins in question into the learning process. Two convolutions 
are stacked and, additionally, dropout with a rate of d = 0.5 is used for regularization. 
The most discriminative features are obtained by combining max pooling and two 
consecutive dense layers. The best performing rebalancing technique on this use case 
is random oversampling which is applied before feeding the data to the classifiers. The 
summarized results can be found in Table 3.2. 

The model results indicate that the overall performance of the classifiers is modest 
when it comes to overall correctness. The GBT model shows the overall best performance, 
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Tab. 3.2: Cross-validated classification results. 


Metric Recall(NOk) Recall(Ok) Accuracy 


RF 69.83 % 53.76% 73.46% 
GBT 73.98 % 39.70% 43.26% 
MLP 16.08 % 95.13% 80.15% 


1D-CNN 75.4% 43.85% 74.71% 


but the difficulty in this use case is the high-class imbalance rate. It appears that the 
distribution of the NOK class is not accurately represented by the classifiers or only at 
the cost of a lower recall on the Ok class. However, from a practical perspective, the 
1D CNN can be used in order to reduce the testing effort of the X-ray machine, which 
can save up to 75.4 % of testing volume, which is shown in Table 3.2 as the recall of 
the NOk class. Surprisingly, the experiments using feature selection methods do not 
improve the performance as we expected even though the dimensionality is reduced. It 
appears that in this use case, every individual pin has to be considered, summarizing 
the data through statistic features leads to an information loss. Lastly, since deep 
learning models usually lack interpretability, it is not clear which discriminative features 
led the classifier to draw its conclusion. By contrast, tree-based methods offer more 
interpretability. Therefore, the decision for a specific model is also a question of its 
accuracy and to some extent its interpretability by domain experts. 


Process Optimization The model deployment is achieved through organizational 
integration into the inspection planning process. As described by [580], the models 
were trained on a cloud infrastructure and stored on an edge device close to the process 
to reduce latency and bandwidth issues. Here, the inspection strategy determines the 
role of the model with respect to inspection planning and design. While an inspection, 
exclusively based on the prediction model, requires high confidence in the model and 
extremely high model accuracy to reach or exceed the level of conventional inspection 
principles, hybrid approaches seem promising for the current state of development. The 
inspection reliability is given by the combination of quality prediction and conventional 
inspection. The introduction of quality prediction in quality assurance can facilitate the 
generation of additional added value by reducing physical inspection volume without 
sacrificing inspection reliability. Two different strategies can be deduced depending on 
the trustworthiness of the model. Either only those parts are subjected to a physical 
inspection whose prediction result was Ok or whose result was NOk. As the class 
imbalance of datasets in the quality context is usually quite high, selecting only NOk 
predicted parts to undergo the physical inspection offers vastly superior potential 
savings. 
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3.2.4.2 Real-Time Quality Prediction in a Hot Rolling Mill Process 

In the hot rolling mill case study, steel blocks move through a process chain to become 
bars. The process chain is shown in Figure 3.7. First, the blocks are heated for 15 hours 
in five different heating zones of a furnace. They are then rolled at the block roll and 
the first finishing roll. The rolling in the second roll is optional. 


Fig. 3.7: Sensor measurements for two different blocks. ©[2016] Springer. Reprinted, with permis- 
sion, from [654]. 


Each block usually moves over a single roll several times, where the number of rolling 
steps, each taking a few seconds, is determined. The blocks are finally cut into smaller 
bars whose quality is assessed using an ultrasonic test, several days later. Online 
measurements on how the blocks are processed are provided by sensors installed along 
the production chain. The sensors measure various physical qualities including the 
air temperature, rolling force, rolling speed, and the height of the roll, with 10 Hz. The 
ultrasonic test results indicate the amount of material containing defects for each bar. 
It is impossible to assess the physical quality of hot steel blocks or smaller bars at 
intermediate steps of the process chain. The blocks must cool down before their final 
quality can be tested in an ultrasonic test. Quality deviations are assessed on specific 
internal quality parameters such as location, manifestation, and frequency of core and 
border displays [404]. In cases where some of the blocks are, for example, wrongly 
heated, energy, material, and human workforce are wasted if they nevertheless continue 
the process chain. Therefore, the goal of the case study is to identify quality-related 
patterns in the sensor data, and to predict the final quality of steel blocks in real-time, 
hence detecting NOk blocks as early as possible. Energy savings can be achieved if, 
following the quality prediction, blocks with estimated defects are sorted out from 
the process early enough or reinserted again in previous steps. In addition, it may 
happen that the required final quality can be reached by adjusting the parameters of 
subsequent processing stations [339]. 

In the following, we describe in more detail which sensor measurements are 
recorded and how they are stored, preprocessed, and analyzed in the context of the 
given case study. See also the Section 3.3, which presents a novel learning method 
inspired by exactly this use case. 
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Data Acquisition and Storage During the period of one year, over one billion mea- 
surements from 30 different sensor types are collected during the processing of about 
10 000 steel blocks, together with the corresponding quality information. Among the 
readings are the air temperature for each furnace zone, rolling speed, force, position, 
and temperature. Domain experts consider these to be the most relevant quality-related 
parameters. For validation purposes and guaranteeing the reproducibility of results, all 
data has been stored in a single SQL database. A Java tool was developed for reading 
and importing the raw data delivered in different files and formats. Once imported, 
sensor measurements can be exported based on filters written in SQL to CSV files, that 
contain all measurements recorded by a particular sensor during the processing of a 
single steel block. For the preprocessing of time series data in production environments, 
a highly modular process has been developed and implemented with Rapidminer [405, 
654]. The following sections provide a summary of the procedure and results already 
presented in [404, 654]. At first, all value series are cleansed by cutting away irrelevant 
parts where no processing happened, as discussed in Section 3.2.3.2. In addition, data 
measurements that lie outside meaningful ranges are marked as outliers and replaced 
by their predecessor value. 

Afterwards, the time series are segmented based on background knowledge as 
described in Section 3.2.3.3. In the case of the heating furnace, for instance, the five 
different heating zones make up natural borders for the segments. Similarly, individual 
rolling steps are natural divisions for all series stemming from the three different rolls. 


Feature Extraction and Model Learning Fach segmentin the time series is described 
by several statistics, and mapped to portions of a fixed-length vector. The 60 000 raw 
series values recorded for each steel block are aggregated to about 2000 features. The 
resulting dataset can then be fed to common feature selection and learning algorithms. 
For 470 processes, the mapping of the resulting cut bars to the steel blocks could be 
established. The feature vectors of these processes have then been used for comparing 
diverse machine learning methods: Naive Bayes, Decision Trees, k-NN, and the SVM 
[404, 654]. It has been shown that including features about the individual segments 
decreases accuracy in comparison to including global information about the value 
series and segments. Features of individual segments were therefore excluded for the 
following analysis, resulting in 218 remaining features. 


Model Evaluation Even with extracted and selected features, none of the classifiers 
mentioned is able to reach a better prediction accuracy than the baseline, which predicts 
the majority label. For getting a better impression of the data, the feature vectors were 
mapped to a two-dimensional Self-Organizing Maps (SOM) (Figure 3.8) where points 
close to each other have similar features (see Section 3.2.3.4). The shading is used to 
indicate a weighted distance between the points, where lighter shades represent larger 
distances. In the SOM on the left-hand side, the points represent the feature vectors of 


134 —— 3 Industry 4.0 


production processes and their colors indicate the final quality of the resulting steel 
bars as Ok (red) and NOk (blue). In many cases, NOk bars are very close to Ok ones 
(see also the zoomed area in Figure 3.8), which means they have highly similar features. 
As a result, the features extracted so far do not suffice to correctly classify low and 
high-quality processes. 


40x30 SOM, Final quality of steel blocks 40x30 SOM, Final size of steel blocks 


@ox @NOK @ iv @ 2 @ 3v @ 4 OSV @ev @ wv @ sv 


Fig. 3.8: Similarity relationships between feature vectors. ©[2016] Springer. Reprinted, with permis- 
sion, from [654]. 


In comparison, the SOM on the right-hand side of Figure 3.8 shows the final size of the 
produced bars. As it seems, the extracted features are highly correlated with distinct 
operational production modes for the different bar sizes. The hypothesis could be 
verified by training a decision tree on features of the first finishing roll. The accuracy as 
estimated by a 10-fold cross-validation is 90 %, while k-NN (k = 11) even achieves 97 %. 
Most important for the decision is the position of the roll (sensor 501). This indicates 
the height of the roll and naturally implies the pressure on the block. This correlates 
with the size of blocks after milling, expressed by 1V, 2V, .... Domain experts have 
verified that the results reflect the real operational model in the rolling mill. That the 
features are correlated with distinct operational modes and not the quality could mean 
that large absolute quantitative differences between the modes, i.e. the global patterns, 
overshadow local patterns. The clustering with SOMs thus gives important hints for 
improving the quality prediction. For instance, in the future, separate models for the 
modes could be trained, more scale-invariant features could be extracted, or the value 
series could be better normalized. 

As the results demonstrate, data analysis methods are capable of detecting mean- 
ingful patterns in production processes. Even though identifying the exact features 
that are relevant for the prediction task is sometimes not straightforward, as already 
discussed in Section 3.2.3.4, each insight into the data can be useful for providing new 
ideas for better feature extraction and process understanding that can be validated 
afterward by interaction with domain experts. 
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3.2.5 Conclusion and Future Outlook 


This contribution gave an overview of data mining for Industry 4.0. The first sections 
offer a general guideline for applications similar to ours. In Section 3.2.2, we present 
the need for intermediate quality prediction in order to reduce material and energy 
consumption and how it can be integrated into an IMPC module in order to prepare 
factories for the adequate use of data mining techniques in a quality-related context. 
Carefully going through all the steps of the data-mining process in Section 3.2.3, we 
present a large variety of methods for each step. The contribution presents two real- 
world case studies in Section 3.2.4: quality prediction in electronics manufacturing 
(Section 3.2.4.1) and in a hot rolling mill process (Section 3.2.4.2). Both use cases demon- 
strate the benefits of real-time embedded data analysis in the production chain for 
quality prediction. 

A new application of multiclass time series classification predicts the quality of a 
bolt in a real-world automotive industry use case. Shorter subsequences of time series 
were determined that already allowed to train a model achieving high recall and F- 
measure (both 97 %) for almost all of the 8 classes except for the two that were only 
present in 1% or 5 % of the data. Detecting the quality as early as possible enables to do 
corrective actions, thus avoiding costly rework and waste of resources through further 
processing of defective components. Moreover, anticipating the type of defect helps to 
estimate the reworking time to correct it, which varies from 5 seconds to 5 hours [557]. 

Another more recent study is about saving quality testing efforts in surface mount 
technology. Since it would take too much time to assess the quality at pin level, the 
quality information of the panels is aggregated at a FOV level, which corresponds to 
the aggregation level of the X-ray inspection. One FOV consists of 6 PCB and is denoted 
as "NOk" if one PCB is detected as defective, whereas it is declared as "Ok" when all 
PCBs are defect-free. In addition to excellent recall and accuracy, the trained model 
was explained using a heat map [558]. This is an important step into the interpretability 
of learned models, but further work is needed. 

Despite several success stories, there are still limitations that slow down the appli- 
cation of the developed machine learning-based solutions in the industry. The first of 
the two most-important obstacles that are still in the way of machine learning adoption 
in real environments is the lack of reliable labeled data in many manufacturing scenar- 
ios. Data gathering, data fusion from heterogeneous sources, and data cleaning require 
ongoing efforts. The second is that the internal organization of companies needs to 
integrate computer scientists with a qualification in machine learning into the higher 
levels of engineering departments. The social integration of employees with diverse 
backgrounds not only at the board of directors is a challenging task, but it is necessary 
to benefit from the full potential of machine learning. 
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3.3 Label Proportion Learning 


Marco Stolpe 
Katharina Morik 


Abstract: For the interlinked process of producing steel bars, quality information is 
given in statistical form for whole charges of blocks, i.e. we know the fraction of blocks 
which had quality related problems. This poses a new kind of problem for machine 
learning, since almost all supervised learning methods assume to be given labels for 
individual observations instead of groups. The problem of learning from fractions of 
labels has become known as the problem of learning from label proportions. In this 
contribution, the learning problem and existing methods for solving it are introduced, as 
well as a clustering-based method called Learning from Label Proportions by Clustering 
(LLPC) developed in the context of project B3. It is demonstrated that LLPC outperforms 
methods that were considered the state of the art at that time in terms of prediction and 
runtime performance. Moreover, the relation to resource-constrained learning settings 
such as distributed learning is shown. 


3.3.1 Introduction 


In smart manufacturing, it can be difficult to track products through the whole process 
chain. For instance, in the interlinked production process of hot rolling, the temperature 
of steel blocks is so high that they cannot be stamped or equipped with RFID chips. Once 
cut to smaller rods, tracking object identity, i.e. which rods belonged to which block in 
which customer order (or batch), can therefore become a big technical challenge. In the 
steel-rolling scenario, quality labels are usually given as percentages for whole batches, 
but not for individual blocks. This is similar to other industries, in which due to cost 
reasons only the quality of a small sample is checked. Depending on the estimated 
fraction of faulty products, either the whole batch needs to be thrown away, checked 
again, or it is accepted that a very small fraction of faulty products is delivered to the 
customer. An interesting question is if we can check only a sample, but nevertheless 
derive information about the properties of individual products. Moreover, can we sort 
out some rods already during the production process itself, before the quality check? 
To put it in more general terms: can we derive a model that assigns correct labels to 
individual products based on their properties, if we are only given the proportions of 
each label for different batches of products? 

The aforementioned problem of learning from label proportions (LLP) not only 
has applications in industry, but in application areas as diverse as privacy-preserving 
data mining, election forecasting, bank customer classification, bankruptcy predic- 
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tion, or marine litter beaching prediction. Especially relevant for us is its relationship 
to resource-constrained distributed learning settings. Distributed machine learning 
receives subsets of the overall data in two ways, horizontally and vertically. Horizon- 
tally partitioned data are separating sets of observations with all their features, for 
instance, the sales of different shops each offering the same items. By contrast, vertical 
partitioning means that each location senses a different subset of features. This is a 
common scenario in the Internet of Things [653]. Our steel rolling scenario is of this 
kind. It also appears, for instance, in traffic prediction problems (see also Section 4.1). 
Here, we might like to reduce communication costs by transmitting only aggregated 
label information between nodes. Again, the question is, if we can learn a model that is 
sufficiently accurate in assigning class labels to individual instances, based only on 
aggregated label information. 

The problem of LLP not only deviates from that of supervised learning, where we 
learn from individually labeled training examples, but also from many other learning 
settings known in machine learning and data mining. It is different from semi-supervised 
learning [120], where we are given at least some examples that are labeled. It is not 
strictly unsupervised learning, since we are given at least some additional information 
about labels. It is different from anomaly and outlier detection, where we might know 
about observations that belong to a normal class. It comes close to multiple instance 
learning [731], where whole bags of observations are either labeled as positive or neg- 
ative. However, LLP is not exactly the same problem, since we are not given binary 
information on each bag, but real-valued statistical information about the labels in 
each bag. 

In Section 3.3.2, the task of LLP is defined more formally. Then, Section 3.3.3 gives 
an overview of related work. In Section 3.3.4, we discuss the difficulty of the problem 
from a Bayesian perspective. Afterwards, Section 3.3.5 defines loss functions for the 
scenario. Section 3.3.6 introduces a clustering approach and variants that minimize 
aforementioned loss functions. In Section 3.3.7, we compare the algorithm’s prediction 
performance and runtime to other existing methods. Finally, Section 3.3.8 concludes 
and gives a short summary. 


3.3.2 The Problem of Learning from Label Proportions 


To the best of our knowledge, Musicant et al. were the first who formally formulated both 
the classification and regression tasks of the problem [466]. We extend their problem 
definition to multi-class problems and relate it to the unknown joint distribution P(X, 4) 
from which all observations and labels are drawn. 


Definition 1 (Learning from label proportions (LLP)). Let X be an instance space and 
Y be a space of categorical class labels Y;,..., Y}. Let P(X, Y) be an unknown joint 
distribution on the instances and class labels. In the setting of learning from label 
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Labeled examples (unknown) Label proportions (known) 


Bı = {(x1, 1), (x3, 1), (x7, 0)} Y ={0,1} 

Bo = {(x2, 0), (x4, 0), (x5, 1), (%6, 1)} 

B; = {(xg,0), (xo, 0)} y=0y=1 

0.33 0.67 \ |&l|=3 


0.50 0.50 |Bo| = 4 
By = {x1, x3, x7} =9 1.00 0.00 / |B|=2 


Bo = {X2, X4, X5, X6} 
Bz = {xs, xo} 


Unlabeled examples (known) 


0.56 0.44 


Fig. 3.9: Example for given bags of observations, a label proportion matrix, and related notations 


proportions, we are given a sample of N unlabeled observations X = (x1,..., Xy) with 
X C X, drawn i.i.d. from P(X). Let y; € Y be the individual class label for observation x;, 
where Y C Y. The individual labels are unknown. Instead, we are given a partitioning 
of X into h disjunct bags B,,..., Bp and for each bag B, and label Y,, we are given the 
proportion Tuv € [0, 1] of observations with that label in bag Bu. Only based on this 
information, we seek a function (model) f : X > Y that predicts the label y € Y for an 
observation x € X drawn i.i.d. from P, such that the expected risk 


Rene f 04, PODAR, Y) 


is minimized. Here, ¢ is a convex loss function £ : Y x Y > R which measures the cost 
of assigning the wrong label to individual observations. 


The given label proportions muy can more conveniently be written as a h x l matrix 
I = (Ttuv), where the values in a row ITy,. = (71y1, ..., Tu1) Sum up to one. The frequency 
count Huy of observations with label Y, € Y in bag Bu can easily be reconstructed by 
multiplying the label proportion muv with bag size |Bu|. 

The proportion (J, Yy) of label Y, over the whole sample can then be calculated 
from IT as the sum of the frequency counts for bag u, divided by the total number of 
observations N: 


h 
1 
n(I, Yy) = ș 2 Pw. (3.2) 
u=1 
Figure 3.9 gives an example of the notations previously introduced, the division of 


observations into disjunct bags, and the label proportion matrix as derived from the 
original (now unknown) labels. 
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3.3.3 Related Work 


When starting the work on LLP in 2010, only a few publications on the topic were 
available. In the following, we’ll discuss related work insofar as it has been relevant 
for comparison with the clustering approach we published at that time [651] [652] 
and which is going to be presented in this contribution. Since its publication, our 
approach has been cited more than 70 times according to Google Scholar and even 
more works on the problem setting have appeared. For instance, different clustering 
methods have been applied [154]. Fish and Reyzin cast light on the theoretical properties 
of the problem in the context of probably approximately correct (PAC) learning [209] 
(for an easy introduction into PAC learning see Mitchell [447]). Saket et al. present an 
approach that does not rely on the underlying distributions of the bags, and give some 
guarantees for any learner [571]. However, they do not cover multiclass learning as we 
do here. Kobayashi and colleagues show estimation bounds for multiclass LLP [333]. 
A probabilistic method for LLP is developed to estimate the individual votes during 
presidential US elections [660]. Also new applications are dealt with such as bank 
customer classification [513], marine litter beaching prediction [273], and bankruptcy 
prediction [133]. 


Related Semi-Supervised Methods There are some approaches that seem similar 
to the scenario of LLP, but are actually semi-supervised learning tasks. For instance, 
Dara et al. first cluster the given data with SOMs and then label the resulting clus- 
ters [159]. However, labeled observations are given, which are usually not available 
when learning from label proportions. Demiriz et al. adapt the k-means optimization 
problem to respect labeled data [167]. Again, this is a semi-supervised setting, with 
labeled observations. 


Basic Methods To the best of our knowledge, Kueck and Freitas were the first who 
introduced the problem of LLP by proposing a probabilistic model based on group 
statistics that is trained by an efficient Markov-Chain-Monte-Carlo (MCMC) sampling 
algorithm [357]. Musicant et al. were the first who defined the problem of learning from 
aggregate values for regression and classification tasks in a more formal way [466]. They 
modify well-known methods such as k-NN [13], backpropagation neural networks [447] 
and the linear SVM [700] to respect the given label proportions. Their experimental 
results focus on regression tasks, while we are mainly interested in classification. 


Mean Map and Laplacian Mean Map The Mean Map method we use for comparisons 
in Section 3.3.7 has been proposed by Quadrianto et al. [514]. It estimates the conditional 
class probability P(Y|X, 6) by conditional exponential models, using a joint feature 
map ¢ that maps observations and labels into a new feature space. The parameters 
6 can be estimated by solving a convex maximization problem for the conditional 
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log-likelihood. The conditional log-likelihood in turn can be expressed in terms of 
the so-called mean operator, which can be expanded into bag-wise and label-wise 
components. The unknowns in this formulation can then be found by solving a system 
of linear equations, without knowing the individual labels. This is possible by making a 
homogeneity assumption, which states the conditional independence of feature vectors 
from bags, given the label. Once the mean operator is estimated, the parameter vector 
6 can be derived by standard methods for maximum likelihood estimation. It is shown 
that Mean Map outperforms kernel density estimation, discriminative sorting, and 
MCMC [357]. 

Patrini et al. relax Mean Map’s restrictive homogeneity assumption such that when- 
ever bags are similar to each other, it is assumed that also their feature vectors are 
similarly distributed, given the label [494]. The relaxed assumption is encoded into a 
regularized least-squares minimization problem, which can be rewritten in matrix form 
by the Laplacian of a symmetric matrix whose entries consist of the similarities between 
bags. The solution to the stated optimization problem can then be obtained in closed 
form. On ten datasets from the UCI standard repository [32], LMM and AMM outperform 
Mean Map, Invcal, and the «SVM in terms of prediction performance and runtime. 
However, since LLM is not kernelized and can only find linear decision boundaries, the 
results and LLM cannot be directly compared with the non-linear clustering algorithm 
introduced in Section 3.3.6. 


Inverse Calibration Riiping proposes the Inverse Calibration (Invcal) method [552]. 
The regression SVM (SVR) is converted into a probabilistic classifier by applying a 
scaling function o to the outputs. According to the author, it is sufficient that the 
predictions of the classifier approximate the given label proportions well on average, 
for each bag. These constraints are integrated as auxiliary conditions into the standard 
SVR optimization problem. As a large margin method, the formulation allows for the 
reduction of model complexity, while the class probability estimates for each bag are 
kept close to the given label proportions for each bag, up to some maximum tolerable 
error. The primal problem can be transformed into its dual, and then solved with a 
standard solver for quadratic optimization. It is shown empirically over twelve standard 
datasets from the UCI repository that Invcal significantly outperforms Mean Map in 
terms of prediction accuracy. 


«SVM _§Invcal treats the mean of each bag as some kind of super-instance, and gives 
each bag a regression label that corresponds to the label proportions. Instead, the «SVM 
proposed by Yu et al. explicitly models the labels of individual observations [751]. The 
label proportions, as calculated from labels assigned to individual observations, should 
match the given label proportions as closely as possible. This criterion is encoded 
as an additional term into the primal problem of the standard SVM. The task is to 
find a vector of labels such that the loss over label proportions and the standard loss 
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over individual observations are minimized. This ensures that observations lying on 
the same side of the hyperplane will be assigned the same label, depending on the 
particular values of the trade-off parameters C and Cx. Although the formulation seems 
intuitive, the optimization problem is a NP-hard non-convex integer programming 
problem. The authors propose two different efficient algorithms for solving it, one 
based on an alternating optimization strategy, and another based on convex relaxation. 
In experiments, the «SVM outperformed Mean Map and Invcal in terms of accuracy on 
several datasets from the UCI standard repository. However, the authors do not report 
which significance test they used. It should be noted that in the work by Patrini et al. 
results are not always in favor of the «SVM in comparison to Invcal, even on the same 
datasets [494]. Moreover, Mean Map outperformed the «SVM in many cases, while 
Invcal outperformed Mean Map in the work by Rüping [552]. 


AOC Kernel K-Means_ AOC Kernel k-means (AOC for Aggregate Output Classification) 
introduced by Chen et al., called AOC-KK in the following, clusters the observations 
such that clusters correspond to classes, and the assignment of observations to clusters 
(classes) matches the given label proportions [131]. The authors present variants of 
k-Means and kernel k-means [174], which is a kernelized version of the original k-means 
algorithm. Here, the cluster centers can no longer be written in explicit form, but have 
to be expressed in terms of a kernel function induced by some feature map @. 

In the objective function formulated, the first term is the same as in the original 
optimization problem of kernel k-means, while the second term measures the deviance 
between the given label proportions and those that would result from the current 
assignment of observations to clusters (classes). In that way, the authors try to find 
a good clustering, i.e. an assignment of observations to clusters (classes), such that 
the within-cluster scatter is minimized, but at the same time that also the given label 
proportions are matched as well as possible. The trade-off between the two criteria 
can be controlled by parameter A. As standard tools for convex optimization cannot 
be used, the authors propose an alternating updating algorithm based on expectation 
maximization (EM) [168]. On two datasets from the UCI standard repository, AOC-KK 
outperforms k-NN and neural networks in terms of accuracy [466]. 

Although AOC-KK shares similarities with the clustering approach LLPC developed 
in Section 3.3.6, there are some fundamental differences. The first is that AOC-KK 
restricts the number of clusters to the number of classes, while LLPC allows for classes 
being represented by more than one cluster. This allows for a better control of bias 
vs. variance, by changing k. Another difference is that AOC-KK combines the loss over 
label proportions with the original kernel k-means objective in the same objective 
function, while LLPC first clusters observations as usual, and then tries to find a good 
assignment of labels to the resulting clusters. LLPC thus has the advantage that it can 
be used with arbitrary partitional clustering algorithms, while AOC-KK works only 
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with k-means and kernel K-means. LLPC is compared to AOC-KK with a quadratic loss 
function in Sect. 3.3.7. 


Theoretical Results In an unpublished work, Yu et al. cast LLP into the PAC learn- 
ability framework [750]. The authors prove that under certain conditions, the labels of 
individual observations can be predicted well when the label proportions per bag can 
be predicted well. The generalization error over bag proportions in turn can be bounded 
by the empirical proportion error if the number of bags is large in relation to the Vapnik 
Chervonenkis (VC) dimension of the underlying hypothesis class H +. The authors 
further show that the probability for classifying instances correctly increases with the 
purity of bags, i.e. if many instances per bag belong to the same class. In extreme cases, 
where all label proportions are equal (i.e. they are the least pure), it can happen instead 
that a hypothesis achieves zero bag proportion error, but nevertheless classifies all 
instances incorrectly. The true error can be even further bounded by making additional 
assumptions on the distribution of bags [750]. 

The aforementioned findings imply that special care must be taken when compar- 
ing the performance of label proportion learning methods. For instance, it must be 
ensured that algorithms are trained and validated on the exact same data splits. More- 
over, since the individual bag distributions can play a big role for the performance of 
algorithms, performance should be assessed over different diverse datasets and results 
need to be tested for their significance, even more so than with traditional supervised 
methods. That a method outperforms another does not mean then that it shows better 
performance in an absolute sense, under all circumstances, but on average. 


Other Works Hernandez-Gonzalez et al. apply a structural EM strategy to learn 
Bayesian network classifiers from label proportions [274]. They compare their method 
to Mean Map and report lower error rate of their method for four of seven domains. 
However, the significance of results is not reported. Fan et al. introduced a generative 
classifier called DNLP, which learns from label proportions by following a deep belief 
network approach [198]. The authors compare their method to Mean Map and Invcal on 
several standard datasets from the UCI repository. In terms of prediction performance, 
they report no significant differences. However, the runtime of DNLP is much lower 
than that of Mean Map and Invcal. Fan and Taylor combine convolutional neural net- 
works (CNN) with probabilistic graphical models trained by an EM approach to learn 
from label proportions in the context of ice and open water classification from image 
data [199]. Their algorithm shows good performance in the context of the mentioned 
application, but isn’t evaluated on other domains. 


1 Foran easy introduction we recommend Mitchell [447] explaining PAC learning and the VC dimension. 
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3.3.4 Difficulty of the Problem 


For getting a better idea about the difficulty of the problem, we discuss the problem of 
LLP from a more Bayesian perspective and relate it to different kinds of better-known 
learning tasks such as supervised, semi-supervised, and unsupervised learning. 

For the supervised learning scenario, we can construct an optimal classifier, called 
the optimal Bayes classifier, if the distribution P(X, Y) is known. From a Bayesian 
perspective, a prediction model can be obtained from estimating the conditional class 
density P(Y | X). Applying Bayes theorem, one recognizes that P(Y | X) may also be 
estimated from other unknown densities—the class-conditional density P(X | 4) and 
the class prior density P(Y): 


P(X| Y) + PY) 
P(X) 
Here, P(X) doesn’t necessarily need to be known or estimated, since it can be calculated 
from P(X | Y) and P(Y). P(Y) may be estimated directly from the data, if the number of 
data points is high enough. Moreover, if the joint distribution P(X, 4) is known, as is 
assumed by the optimal Bayes classifier, all other quantities can be derived from it. For 
a given observation x € X to classify, the optimal Bayes classifier would predict the 
most probable class, which is also known as the MAP criterion. Here, optimal means 
that the Bayes classifier is the best classifier over all possible classifiers for the given 

data. 


PCY |X) = (3.3) 


Best Case With respect to LLP, the class prior P(Y,) for label Y, can be estimated as 
nUI, Yv), the proportion of Yv. This is done, for instance, by the Mean Map method. 
Finding a good estimate for P(X | Y), however, is at least as difficult as in the supervised 
scenario and equates to it if each bag Bu only contains observations from a single class 
and at least l bags contain observations from different classes. This scenario may be 
called the best case, since in LLP, usually less information about individual labels is 
given. Our intuition matches the findings of Yu et al. where it has been proven that the 
probability of classifying instances correctly increases with the purity of the bags [750]. 
When each bag only contains examples from the same class, each bag is as pure as 
possible. 


Worst Case In the worst case, all label proportions uv in matrix IT are equal, i.e. least 
pure, and labels can only be guessed correctly with probability 1/1. If sample size is 
large enough, the worst case can only occur if also all class priors P(Y,) are equal, and 
the label does not depend on the bag, i.e. P(Y|u) = P(Y). Otherwise, we can estimate 
P(Y) from the data and at least predict the class that has highest probability to occur 
(i.e. the majority class). In this case, the probability for predicting the correct label 
would be higher than 1/1. 
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The worst case can only occur with large amounts of data, where the label proportions, 
given as relative frequencies, will approach the true probabilities of classes in each bag, 
or in the context of privacy-preserving data mining, where we have full control over the 
formation of bags and want to make the problem as difficult as possible. In general, a 
small number of large sized bags can make the problem more difficult, as Yu et al. have 
shown [750]. For smaller random samples, we may usually expect slight deviations 
of the proportions in matrix IT, which may help with making a correct decision about 
class labels. For instance, when randomly uniformly sampling the 50 observations 
per class from the well-known Iris dataset into bags, in many cases the clustering 
approach developed in Section 3.3.6 classifies at least 96 % of the observations correctly 
on average. 


Average Case In cases where observations have been sampled more or less randomly 
into bags, a first intuition might be that bags that are more “pure”, i.e. that contain 
more instances of the same class, provide more information. Yu et al. also show that 
the probability of classifying individual instances correctly increases with the purity of 
bags [750]. However, only an upper bound is shown for the probability of misclassifying 
a fraction of individual observations incorrectly. In practice, cases may occur where we 
perform well, despite label proportion matrix IT having high entropy, or badly, despite 
IT having low entropy. 

For instance, bags with low information content in terms of labels may nevertheless 
provide information about the underlying distribution of observations, P(X). Getting 
more information about P(X) by taking unlabeled observations into account as well 
can increase prediction performance when only a few labeled examples are given [120]. 
Conversely, even if a bag has high information content in terms of labels, learning might 
not profit from it if the sample doesn’t represent the underlying data distribution. 

For practical cases, it is therefore hard to find a measure of problem difficulty. While 
it is easy to measure the entropy of I, it is difficult to measure how well bags reflect 
the overall data distribution given a concrete sample, without knowing the underlying 
data distribution—which is the crux of learning. 


3.3.5 Loss and Risk 


There are different possible ways to define loss functions over label proportions. First 
we define measures of the quadratic deviation between the label proportions as being 
derived from a previously trained prediction model f and the given label proportions. 
Applying the trained model to a set of observations x; € X, the resulting label propor- 
tions can be calculated by counting the number of observations x; with Î (xi) = Yy, in 
each bag for each label Yy € Y and dividing such counts by the size of their respective 
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bag. This leads to a new matrix T $ containing the model-based label proportions: 


T= (hae v= gg D PY), T= GA) 


xEBu 


1:f00) =Y, 
0:f(Qx) #4 Yv 


Similar to when defining a loss function for individual observations, it is now possible 
to define a loss function over individual matrix entries by say, taking as loss the squared 
error (Muy — yla). The total deviance between H and T p can then be defined as the 
average squared error over all matrix entries: 


h ıl ; 
louse T) = 77 Y Y Otw - yia)? (3.5) 


The average squared error ¿msg doesn’t take into account the relative group and class 
sizes; nor can it catch the situation where two hypotheses f, and fy appear indistin- 
guishable from each other, because the total error sum over all matrix entries is the 
same. In practice, it can make sense to measure the error between IT and T 7 by £r, 
which we define as the geometric mean of two different error measures weight and Lprior 
which deal with the previously mentioned disadvantages: 


én (I) = éveigne(IT, 7) i prior, r;) with (3.6) 
1 h l IB | X 
Lweignt (HT, r;) a 5 DD nM, Yy) E (Tuv - yy)? and (3.7) 
u=1 v=1 
1g 2 
LPrior(I, T) T 5 (n, Yy) - nly, ¥y)) (3.8) 
val 


weight Weights the squared error of individual matrix entries by their relative group and 
class size. prior measures how well a chosen hypothesis matches the class priors, as 
estimated by n(/7, Yv). The choice to include the prior in the loss function has been made 
based on empirical evaluations and a close examination of the label proportion matrices 
which have lead to misclassifications. What we have observed in our experiments 
now has a theoretical justification. As shown by Yu et al., whenever a hypothesis 
matches the class priors and observations in bags are distributed i.i.d., the probability 
of misclassfiying a fraction of individual observations is bounded [750]. 

Moreover, if in addition to the label proportions, the true labels y(x) of a subset 
T C X of observations x € T are given, error criterion (Equation 3.6) can be easily 
extended to include the average loss £r over these labeled training examples: 


; 1 “ 
rr = \/lweignt * Prior lr with fy = IT 5 L(x), f) (3.9) 
xeT 


Algorithms that optimize over £y can thereby easily consider also labeled observations 
in addition to the given label proportions. 
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3.3.6 Learning from Label Proportions by Clustering 


The goal in LLP is to find a function f that predicts the proportions of previously unseen 
bags as well as possible, which in turn bounds the risk of misclassifying individual 
observations, as shown by Yu et al. [750]. The authors pose this problem in terms of 
empirical risk minimization. However, if we allow for arbitrarily complex hypotheses, 
we can always match the given label proportions. In particular if we tried all different 
possible labelings of observations exhaustively, we would always find a set of labelings 
that minimizes one of the previously introduced loss functions. We would expect only a 
few of such labelings to also minimize the empirical loss over individual observations, 
i.e. we somehow need to control the capacity of our hypothesis class. 

The particular LLP approach proposed in the following is based on the assumption 
that observations lying close together in regions of the input space also share the same 
class label. It first forms clusters of similar observations using an arbitrary partitional 
clustering algorithm and respective distance measure. Instead of trying all possible 
labelings of observations, the algorithm heuristically tries different labelings of clusters, 
such that a loss function over label proportions is minimized. The capacity of the 
hypothesis space can thus be controlled by varying the number of clusters k. A small 
number of clusters leads to high bias, but low variance. A larger number of clusters 
allows for ever smaller divisions of sample X, and therefore leads to low bias, but high 
variance. 

The assumption that clusters represent classes is not necessarily correct. Hastie et al. 
demonstrate that especially the weighting of features can have an enormous influence 
on clustering results [259]. In fact, one advantage of supervised methods over unsu- 
pervised ones is that they can determine the relevance of features in relation to the 
target variable. We therefore allow for a certain flexibility in distance measures. Such 
measures should respect weights wj € [0, 1] for each feature Aj, as given by a vec- 
tor Ww = (w1, . .. , Wq). Usually, such weights are specified by a domain expert. In the 
clustering approach introduced in the following, however, the relevance weights can 
be approximated automatically by an evolutionary strategy, based on one of the loss 
functions defined in Section 3.3.5 (or other loss functions for LLP). 

In the Section 3.3.6.1, the accompanying optimization problem is stated. Then, in 
Section 3.3.6.2, an approach for solving it is described. The algorithm can be used with 
different labeling strategies which are presented in Section 3.3.6.3. The approach’s 
runtime is analysed in Section 3.3.6.4, while Section 3.3.6.5 explains how to classify 
new examples, based on a set of labeled clusters. 


3.3.6.1 Optimization Problem 
Let the vector Ag = (A;,...,A,) with A; € Y represent a labeling for a clustering 
C = {Cy,..., Cy}. Let fe : X > Y be a mapping that returns the label A; for a given 


observation x € C;. Given a clustering C, we search for a labeling Ñe of the clusters 
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that minimizes the error, according to some error measure oe U1, T $ ), between the 
e Xe 


model-based label proportions T a and the known label proportions ÍT: 
de 


ec 


min 4> (I, rT; ) (3.10) 
Ke Ae fie 


The error measure could be, for instance, the average squared error ys or a combined 
error measure such as £y. 

Let qx be a function which is able to assess the quality of a clustering based ona 
similarity measure that respects feature weights. This usually means that observations 
x; € X are represented as d-dimensional feature vectors Xi = (xj1,..., Xia) with xj € R. 
We are trying to solve the optimization problem 

min bi (I, li ), Xe = argming fie (I, Th? e* = argmaxeq,;(C), (3.11) 
i.e. we are searching for a clustering C* which maximizes qx and whose labeling Ag 
minimizes GA , for all possible weight vectors w. As formulated, with arbitrary functions 
dq and le , the problem is non-convex. Since we want to allow for flexibility in the 
choice of such functions, in the following we approximate solutions by an evolutionary 
strategy. 


3.3.6.2 The LLPC Algorithm 

The LLPC (Learning from Label Proportions by Clustering) algorithm solves prob- 
lem (3.11) by an evolutionary strategy. For each weight vector w, the sub-optimization 
problem of maximizing qx is solved by an inner clustering algorithm, where the par- 
ticular gz depends on the algorithm. The only prerequisite for the clusterer is that it 
returns disjunct clusters and respects different feature weights. The sub-optimization 
problem (3.10) is independent from the clusterer and currently can be solved by different 
labeling strategies, of which two are introduced in Section 3.3.6.3. 

In more detail, LLPC takes a clustering algorithm clusterer, a labeling algorithm 
labeler and an error measure GA as parameters, in addition to JT, X, B = {B1, .. ., By} 
and Y = {Y1,..., Yı}, which are related to the task of LLP, and a set of parameters 
evo related to the evolutionary learning strategy. LLPC then approximates the optimal 
weight vector and returns w”, as well as the related clustering C* and labels Xs for the 
clusters. The returned weights w` can be interpreted as the importance of individual 
features and thus give valuable additional information for the interpretation of cluster 
models. 

We use the evolutionary strategy described in [444]. The evolutionary strategy 
starts with a random population P of normalized weight vectors w, i.e. wj € [0, 1]. 
For each individual in P, the clustering algorithm clusterer is called. The clusters are 
labeled according to the given labeling algorithm labeler and the fitness is evaluated 
by criterion £ Te If the fitness is higher than the best fitness seen so far, the newly found 
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Algorithm 1: The LLPC algorithm 

1 Function LLPC(/,X,8,Y,clusterer,k, labeler,t;_,evo) 

2 best_fit :=-0c0; generation :=0; 

3 Randomly initialize a population P of psize normalized weight vectors ; 
4 while generation < maxgen do 


5 for w € P do 
6 C := clusterer(X, k, W) ; 
7 (Ñe, err) := labeler( €, B, T, Y, li) 
8 if best_fit < -err then 
9 | best_fit := -err; C* := C; VA =e; W =W; 
10 end 
u end 
12 generation := generation + 1 ; 
13 if generation < maxgen then 
14 Pcopy :=P; 
15 Gaussian mutation of weights in Pcopy with variance mutvar ; 
16 Pchildren := Uniform crossover on P U Pcopy with probability crossprob ; 
17 P := Tournament selection with size tournsize on P U Pcopy U Pchildren ; 
18 end 
19 end 


* > ok 
20 return C”, AQ, W`; 


21 end function 


clustering, labeling, and weight vector are memorized as the new best ones. In each 
generation, the weight values in a copy of P are mutated by a Gaussian distribution and, 
with a certain probability, exchanged with P by a crossover operator. The individuals 
then take part in a tournament and only the best ones are kept in the next generation. 
This process is repeated until the maximum number of generations as specified by the 
user is reached. 

Using an evolutionary strategy as a wrapper has the advantage that it is not nec- 
essary to integrate the error measure bie into the optimization problem of the inner 
clustering algorithm, as was done in AOC-KK. The clustering algorithm can thus be 
treated as a black box and easily exchanged, without any further adaptation. It should 
also be noted again that in contrast to AOC-KK, LLPC allows for classes being repre- 
sented by more than just one cluster (k > 1). Thereby LLPC allows for ever smaller 
divisions of sample X, i.e. parameter k may be seen as a control parameter that trades 
off bias against variance, as previously discussed. 

The free choice of clustering algorithm allows for respecting different kinds of 
data distributions. For example, LLPC was run successfully with k-means [426], kernel 
k-means [174], EM clustering [168, 731], DBSCAN [191], PROCLUS [11] and Support Vector 
Clustering (SVC) [55], without modification. Moreover, LLPC can be used with different 
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error measures, such as criterion (Equation 3.9) that can respect individually labeled 
examples. 

LLPC may therefore be looked at as a meta-algorithm for learning from label propor- 
tions, which allows for the use of different clustering algorithms, labeling strategies and 
loss functions. In a further step, one might also exchange the evolutionary algorithm. 
For instance, it might be adapted to not only minimize GA over weight vector w, but 
also over hyperparameters such as k in the case of k-means clustering, or C and the 
RBF kernel y in the case of SVC. 


Algorithm 2: Labeling of clusters by local search with multistarts 
1 Function LocalSearchMultiStart(C, u, II, k, 4, bia starts) 


2 best = -œ ; 
3 for iteration < 1, starts do 


4 Xe, Abestiter € (A1, . . . , Àx) with A; € Y chosen uniformly at random 
5 start, bestIter < Lis (I, I; ); // calculate initial fitness 

6 improving €< true ; £ 

7 while improving do 

8 for kpos < 1, k do 

9 // at each position... 

10 for lpos < 1, |4| do 

u //...tryalllabels... 

12 Àkpos € Yipos ey; 

3 fitness € EAA (I, T: Ro) // calculate fitness 

14 if fitness > bestIter then 

15 Abestiter <& Ñe; start, bestIter < fitness ; 

16 break // leave both for loops 

17 else 

18 | Àkpos € Apestter: // reset to best label found at kpos so far 
19 end 

20 end 

21 end 

22 if bestIter = start then 

23 // Nothing better found 

24 improving <€ false ; 

25 end 

26 end 

27 if bestIter > best then 

28 best € bestIter; Abest < Abestiter // remember best solution 
29 end 
30 end 


31 return best, -best 


32 End Function 
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3.3.6.3 Labeling Strategies 
The following two labeling algorithms solve the sub-optimization problem (3.10) and 
can be used as the labeler in LLPC. 


Exhaustive Labeling As long as k can be restricted to a small number and | = 2 
for a binary classification problem, trying I‘ possible labelings for a clustering € is 
no problem. In experiments (see Section 3.3.7), good solutions often were found for 
6 < k < 12. For each labeling, we need to calculate erry, . For error measures like msg or 
£y, this takes linear time in the number of PON E N. In case of the aforementioned 
error measures, the calculations only involve basic operations such as count, addition, 
multiplication, and division. 


Local Search with Multistarts For cases where the number of clusters k > 12 or 
the number of labels l > 2, a local search that is started multiple times with different 
random combinations of labels is proposed. The local search greedily improves on 
the current labeling of clusters by trying all possible labels at each component of the 
labeling vector Ñe. Fitness measures how well the model-based label proportion matrix 
T pas calculated from the current labeling, matches the given label proportions I. If 
the fitness improves, the search starts again from the first component of the labeling 
vector Ke. Otherwise, it resets the label at the current position kpos to the label of the 
best (local) solution found so far. The best labeling found over all starts of the different 
greedy searches is returned. 

In each iteration, the greedy search runs until no further improvement is possible 
(which is a stopping criterion). Moreover, at each step of the algorithm, the fitness 
either improves or stays the same. Therefore, each search finds a local minimum. Since 
the number of searches is finite, the returned labeling vector is also locally minimal. 
Although, in contrast with the exhaustive labeling strategy, it cannot be guaranteed that 
a globally optimal solution will be found, it has been demonstrated that the heuristic 
labeling strategy performs well in real-world applications such as reducing communi- 
cation costs in distributed machine learning applications [655]. 


3.3.6.4 Runtime Analysis 

The user-specified parameters maxgen, psize and tournsize in LLPC are constants. 
They do not depend on the number of observations N and limit the number of itera- 
tions of the evolutionary strategy to be constant. As discussed in Section 3.3.6.3, the 
asymptotic runtime of the labeling strategies is linear in N, as k and l are constants and 
the evaluation of erry, usually takes linear time. The asymptotic runtime of LLPC will 
otherwise depend on the used cluster algorithm. For example, if we allow for approxi- 
mate solutions and limit the number of iteration steps, k-means has a linear runtime. 
Hence, overall LLPC has linear runtime, which makes it especially well suited for use 
in resource-constrained settings such as applications in the Internet of Things settings. 
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However, when used with an algorithm like kernel k-means, the runtime of LLPC can 
also become quadratic, for instance. 


3.3.6.5 Generating a Prediction Model 
The LLPC algorithm returns labeled clusters of sample X. It is then possible to assign 
labels to individual observations x; € X with fi . The question is how to predict the 
labels of new observations, i.e. how to transform a clustering into a prediction model. 

In the case of clustering algorithms which return a model-based description of 
clusters such as k-means which returns cluster means, one can simply use the model to 
assign new observations to a cluster, and then predict the cluster’s corresponding class 
label. For instance, in the case of k-means, one can assign new observations to their 
closest cluster mean and predict the corresponding class label, by applying function 
hi . Whenever a clustering algorithm is purely descriptive, i.e. in cases where it only 
returns a clustering of X, but no model to assign new observations to clusters, one may 
use a nearest neighbors approach such as k-NN for classification. 

In general, one option for getting a prediction model after running LLPC is to train 
a standard classifier such as Naive Bayes [301] or a SVM [700], based on the current 
labeled observations. Taking this approach, LLPC may be regarded as a preprocessing 
step before modeling, in which the missing labels of observations in sample X are 
restored, based on the given label proportions. 


3.3.7 Evaluation 


In this section, we evaluate the general method. Motivated by the steel scenario, the 
method needs to be carefully evaluated in order to be trustfully applicable in diverse 
industrial applications. Since the method is general, the LLPC algorithm is compared 
with three state-of-the-art methods for LLP: the Mean Map [514] method, Inverse Cali- 
bration (Invcal) [552], and AOC Kernel k-Means (AOC-KK) [131]. The comparisons are 
performed using standard benchmark datasets or generated test data, as is usually 
done. 

LLPC is written in Java and has been implemented in the form of several operators in 
RapidMiner (https://rapidminer.com). All results are based on using fast k-means [187] 
as an inner clustering operator, which is a variant of k-means utilizing the triangle 
inequality for faster distance calculations. Observations x; € X are represented as 
d-dimensional feature vectors X; = (xj,..., Xia) with x; € R. As a distance measure 
we have used the weighted Euclidean distance with weight vector w = (w1,..., Wg) 


d 
dz (Xj, X;) = X (wxi T WjXij 2 (3.12) 
j=l 
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Tab. 3.3: UCI datasets used for the experiments. ©[2011] Springer. Reprinted, with permission, 
from [651]. 


Dataset N d | Dataset N d 


CREDITA 690 42 | SONAR 208 60 
VOTE 435 16 | DIABETES 768 8 
COLIC 368 60 | BREAST CANCER 286 38 


IONOSPHERE 351 34 | HEARTC 303 22 


which weights each feature x;j of example x; differently. In all experiments, we used the 
exhaustive labeling strategy (see Section 3.3.6.3) with loss function Zy (see Section 3.3.5). 
AOC-KK has been implemented using a combination of Java, RapidMiner, and Matlab. 
For Mean Map and Invcal, R scripts were used, which were provided by the author of 
Invcal [552]. 


3.3.7.1 Prediction Performance Experiments 

The accuracy of LLPC, AOC-KK, Invcal, and Mean Map has been assessed on the eight 
UCI [32] datasets shown in Table 3.3. Each possible value of a nominal feature has 
been mapped to a binary numerical feature with values 0 or 1. Numerical features were 
normalized to the [0, 1] interval. Table 3.3 shows the number of features d after this 
preprocessing step. 

In each single experiment, the accuracy has been assessed by a 10-fold cross- 
validation. For LLP, we have partitioned the training set of a particular fold into bags of 
size o, by uniform sampling of observations. While such uniform sampling might not 
reflect the way in which bags are formed in a real-world setting, it allows for a more 
homogeneous interpretation of results across different datasets than domain-specific 
sampling based on feature values. We tried several bag sizes o: 2, 4, 8, 16, 32, 64, and 128 
(with the last bag smaller than o, if necessary). The label proportions were calculated 
and the individual labels removed. In each fold, the accuracy of the learned prediction 
model has then been calculated on a labeled test set. 

The kernel methods Mean Map, Invcal, and AOC-KK have been tested with the 
linear kernel, polynomial kernels of degree 2 and 3, and radial basis kernels (y = 0.01, 
0.1 and 1.0). LLPC has been tested for cluster sizes k € [2, 12]. As parameters for the 
evolutionary strategy, we used maxgen = 10, psize = 25, mutvar = 1.0, crossprob = 
0.3 and tournsize = 0.25. Running LLPC with k-means provides a prediction model 
consisting of cluster means with associated class labels. The same is true for AOC- 
KK. However, the cluster methods also assign labels to each observation in sample X, 
allowing for a subsequent training of other classifiers, as described in Section 3.3.6.5. 
Based on such labeled examples, we have trained models for Naive Bayes [301], k- 
NN [13], decision trees [516], random forests [102], and the SVM [700] with linear and 
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radial basis kernel. The model parameters of each method have been optimized by a 
grid or evolutionary search. 

The combination of all datasets, bag sizes, classifiers, their variants and parameters 
results in a total of 13 216 experiments: 672 for Mean Map and Invcal, 2688 for AOC-KK, 
and 9856 for LLP. For bag sizes 16, 32, 64 and 128 on the datasets COLIC and SONAR, 
and for bag size 128 on CREDITA, we conducted additional experiments with LLP for 
maxgen = 5 and psize = 100. In some cases, we achieved better prediction accuracy. 
All experiments took about three weeks. They were run in parallel on up to six machines 
with an AMD Dual-Core or Quad-Core Opteron 2220 processor and a maximum of 4 GB 
main memory. 


3.3.7.2 Prediction Performance Results 

Figure 3.10 contains plots of the highest achieved accuracies for all datasets and bag 
sizes, based on the best performing models of LLPC, AOC-KK, Invcal, and Mean Map, 
over all conducted experiments. LLPC shows a higher accuracy than Invcal for many bag 
sizes on the datasets CREDITA, VOTE, COLIC, SONAR, and BREAST CANCER. On CREDITA, 
VOTE, IONOSPHERE, SONAR, and DIABETES, the variance of accuracy between bag sizes 
is smaller for LLPC compared with the other methods. Mean Map performs worse than 
LLPC and Invcal in many cases. The performance of AOC-KK varies, depending on the 
dataset. It shows good performance on BREAST CANCER and HEARTC, but not on the 
others. Except for the BREAST CANCER and VOTE datasets and a few other accuracy 
values, the overall accuracy of all methods decreases with a larger bag size. The results 
thus confirm the theory: with larger sizes of bags, without increasing the size of sample 
X, learning becomes more difficult. 

The statistical significance of results can be assessed with the adjusted version 
of the Friedman test, as proposed by [169]. The test is a non-parametric equivalent of 
ANOVA and ranks the classifiers for each dataset separately. Under the null-hypothesis, 
the average ranks of the classifiers should be equal. For comparing LLPC to all others, 
we proceeded with the two-tailed Bonferroni-Dunn test as a post-hoc test. 

Table 3.4 can be understood as asummary of the detailed plots shown in Figure 3.10, 
giving a better understanding and overview of LLPC’s overall performance. The table 
shows the average ranks of the compared classifiers and their difference to LLPC’s rank. 
Each rank was calculated based on the best performing models (including the standard 
classifiers), over all conducted experiments. The table also shows the critical difference 
(CD) values for the Bonferroni-Dunn test. The CD for o = 128 is different, because Mean 
Map was not included in the comparison, due to missing values. LLPC has the highest 
rank in six cases, for ø > 2. At the 10 %-level, LLPC is significantly better than AOC-KK 
for o = 8, better than Invcal for ø = 128 and better than Mean Map for o = 4, 8, 32, 
and 64. In all other cases, LLPC performs equivalently. 
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Fig. 3.10: Highest accuracies for all datasets and bag sizes, over all 13 216 runs of LLPC, AOC-KK, 
Invcal, and Mean Map (plus the additional runs of LLPC with maxgen = 5andpsize = 100). 
Some values for Mean Map and bag size 128 are missing in the plots, due to an error in the R script. 
©[2011] Springer. Reprinted, with permission, from [651]. 
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Tab. 3.4: Average ranks of classifiers by bag size, and their difference to LLPC’s rank, based on the 
best models for each dataset and bag size. Positive difference values indicate a better performance 
of LLPC. Highest ranks and significant differences (higher than CD) at the 10 %-level are marked in 
bold. ©[2011] Springer. Reprinted, with permission, from [651]. 


o 2 4 8 16 32 64 128 


AVERAGE RANKS 


LLPC 2.500 1.875 1.500 1.875 1.625 1.375 1.375 


AOC-KK 2.000 2.750 3.000 2.875 2.625 2.375 2.000 
Invcal 2.000 1.875 2.375 2.125 2.125 2.275 2.625 


Mean Map 3.500 3.500 3.125 3.125 3.625 3.875 - 
DIFFERENCES, CD.42g=1.4317, CD12g=0.98 


AOC-KK -0.500 0.875 1.500 1.000 1.000 1.000 0.625 
Invcal -0.500 0.000 0.875 0.250 0.500 1.000 1.250 
MeanMap 1.000 1.625 1.625 1.250 2.000 2.500 g 
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Fig. 3.11: Average runtime and accuracy of 10-fold cross-validations with LLPC, Invcal, Mean Map, 
and AOC-KK on several samples of random data. The data was generated for a two Gaussian mixture 
classification problem (N = 10000, d = 10, feature values normalized to [0, 1]). ©[2011] Springer. 
Reprinted, with permission, from [651]. 


3.3.7.3 Runtime Comparison 

For an empirical runtime comparison of the algorithms, we have generated random 
data for a two Gaussian mixture classification problem (10 000 observations and 10 
features, with values normalized to [0, 1]). Then, the average runtime for training and 
the accuracy of the classifiers for 10-fold cross validation has been assessed for different 
samples of the data, with varying sizes (see Figure 3.11). The bag size for LLP has been 
o = 16 for all runs. A radial basis kernel with y = 0.1 has been used for the kernel 
methods. LLPC has been run with the exhaustive labeling strategy and fast k-means 
(k = 6), with parameters maxgen = 3, psize = 25, mutvar = 1.0, crossprob = 0.3 
and tournsize = 0.25 for the evolutionary optimization. Both LLPC and AOC-KK used 
the cluster mean model for prediction. 
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LLPC shows a high prediction performance for all sample sizes. Moreover, LLPC has the 
lowest runtime. However, since the methods are implemented in different programming 
languages (Java, Matlab, R), one should not compare the absolute times, but the slope 
of the curves. The curve of LLPC’s runtime is very flat and almost a straight line, while 
the slopes of the other curves indicate runtimes that are faster growing. The results 
empirically demonstrate LLPC’s small runtime, which makes the algorithm well suited 
for resource-constraint settings, especially since centroid cluster models also have a 
small memory footprint. 


3.3.8 Summary and Conclusions 


We have presented an approach for LLP known as the Learning from Label Proportions 
by Clustering (LLPC) algorithm. The approach is general enough to accommodate for 
the use of different clustering algorithms, labeling strategies, and loss functions. With 
k-means as the inner clustering algorithm and a constant number of iterations, LLPC 
has only linear worst-case training time and its cluster mean models are small and fast 
to apply. In comparison with state-of-the-art methods, which need more training time, 
the cluster mean models show a significantly higher or equivalent prediction accuracy 
in the conducted experiments. By training other classifiers on the labeled clusters, the 
highest achieved accuracy of LLPC is significantly higher for even more bag sizes. Here, 
LLPC has the highest average rank for all ø > 2. In addition, LLPC has other beneficial 
properties of which, to the best of our knowledge, other approaches don’t possess all at 
once: LLPC can handle (1) non-linear decision boundaries, depending on the choice of 
clustering algorithm, (2) multiple classes, (3) additionally given labeled observations, 
and (4) it can weight the relevance of features. 

LLP has relevance for real-world applications such as guaranteeing the privacy of 
democratic free elections, or the reconstruction of labels for objects that are hard to 
track, like those in smart manufacturing. Due to the small memory footprint and fast 
application of centroid cluster models, as well as a linear training time, LLPC is also well 
suited for running on resource-constrained devices in the Internet of Things, like edge 
devices in distributed computing. It has been developed for a series of processing steps 
ina steel rolling mill allowing for the early quality prediction during the processing (see 
Section 3.2.4.2). It also has been applied successfully in the field of traffic prediction (see 
Section 4.1), where it was used to reduce communication costs in a vertically-distributed 
machine learning setting [655]. 

Parts of this contribution were previously published in conference proceedings by 
Springer [651], [654] and in the first author’s dissertation [652]. 
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Abstract: The integration of sensor and simulation data for Machine Learning (ML) 
has become a hot topic. On the one hand, machine learning in the form of Active 
Learning (AL) allows running exactly those simulations that are needed for predicting 
future events based on the analysis of explanatory variables. Since simulation is a time- 
consuming process, we save computational resources by selecting the most informative 
simulation configurations. On the other hand, simulations represent expert knowledge, 
so that joining simulation and machine learning from observations leads to better 
predictions. 


The combination of simulation and machine learning has been successfully used 
to optimize milling processes. Regarding undesirable vibrations of milling tools, a 
learning-based prediction of a stability criterion is realized. Furthermore, forces of 
milling operations are predicted using a developed data fusion of sensor and simulation 
data. Apart from forecasting process characteristics directly, machine learning also 
identifies parameter values for simulation models. In particular, the machine tool 
dynamics of a geometric physically based milling simulation system are successfully 
parametrized for different poses of the machine tool axes to reduce the number of 
calibration measurements required. 


3.4.1 Introduction 


In production engineering, many challenges arise regarding process design due to 
the high complexity and huge variety of engagement situations between the cutting 
tool and the workpiece to be machined [20]. For milling processes, different process 
parameter values can lead to different results for the machined component. Especially 
in the aerospace industry, where deep cavities have to be milled when machining 
structural components, the required long and slender milling tools can be susceptible 
to undesirable vibrations that can negatively affect the machined workpiece surface 
and can lead to increased tool wear. The tool wear can in turn influence the dynamic 
process behavior [20, 275]. 

Simulations can support the design of such processes [728]. There are different 
simulation approaches. Finite element (FE)-based methods represent complex interac- 
tions between the cutting edges of the tool and the machined material by numerical 
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approximations of differential equations [676]. Since the material is constantly being 
removed from the workpiece, the resulting high degrees of the deformation require 
relatively small simulation time steps. As a consequence, the simulation runtimes are 
high and only a few tooth engagements can be simulated in a reasonable time which 
makes the transfer to complex and long-running processes challenging [208]. By con- 
trast, using geometric physically based simulation approaches, entire processes can be 
investigated [728]. This is due to the use of simple surrogate models to represent the 
tool and workpiece. However, this entails that complex interactions, such as tool wear, 
cannot be modeled reasonably. Moreover, while such methods are significantly faster 
than FE-based approaches, they are still not real-time capable. 

In the context of Industry 4.0, the vision of realizing self-optimizing machining 
systems by incorporating machine learning methods has emerged recently and started 
to attract the attention of the manufacturing community [449]. By using these methods, 
predictions of process characteristics for unseen input data can be achieved in real- 
time [128, 491]. Several investigations can be found in literature, which focus on the 
prediction of different process characteristics for milling operations using machine 
learning methods, e.g. surface roughness [337, 384, 536], process forces [498, 699], or 
chatter vibrations [482, 690]. Furthermore, over the past decade, fusing multiple sensor 
signals has been a popular approach to increase the information gain for different tasks, 
especially for tool condition monitoring [230, 717, 765]. 

The predictive accuracy of machine learning models can be further improved by 
combining sensor data with simulation results. Denkena, Dittrich, and Uhlich [170] 
trained a model using support vector regression. Simulated and measured data are 
used as features to predict shape deviations of the workpiece. The simulated data is 
calculated using measured values of the axis positions of the tool. Plakhotnik et al. 
[503] combined sensor data, such as machine axis positions and spindle torque, with 
results from computer-aided design and geometry simulations to visualize specific 
process characteristics and support process design. Peng, Bergs, Schraknepper, Klocke, 
and Débbeler [498] utilized FE-based simulations and measurements to train a tool 
wear-dependent cutting force model based on neural networks. 

In this section, different aspects concerning the combination of simulation tech- 
niques and machine learning are discussed. As shown in Figure 3.12, it is analyzed 
how simulations can be replaced by machine learning models to enable real-time 
predictions [560]. The models learned on simulation data can then be refined by a 
selected number of experimentally acquired data to close the gap between simulation 
and experiment, which may be caused by the simplifications of simulation models. In 
this context, to improve resource efficiency, experiments should only be performed for 
scenarios whose inclusion is expected to maximize the improvement of the prediction 
accuracy of the models. 

The research concerning AL involves two real-world applications. Reducing the 
computational resources of an expensive FE simulation is studied in the field of tunnel- 
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Fig. 3.12: Concept of combining simulation and machine learning [560]. 


ing (Section 3.4.3.1). The cheaper geometric physical simulation for sampling training 
data is studied when refining a model for milling (Section 3.4.3.2). 

Another focus of the investigation is the fusion of simulation results with sensor 
data to predict future process characteristics for milling operations [206]. To this end, 
different fusion strategies are evaluated and compared for the milling application 
(Section 3.4.4). Furthermore, the first results are obtained for the integration of machine 
learning methods into a geometric physically based simulation system in order to learn 
pose-dependent dynamic models [207]. Using ML for initializing simulation models is 
illustrated by applications in milling (Section 3.4.5.1) and grinding (Section 3.4.5.2). 


3.4.2 Simulation of NC-Milling Processes 


In milling applications, we use a geometric physically based simulation system [728]. For 
the tool and workpiece model, the Constructive Solid Geometry (CSG) [210] technique is 
used. Thereby, the tool model can be realized by modeling the envelope of the rotating 
tool by combining simple geometric primitives through Boolean operators. For modeling 
the initial workpiece, the use of a cuboid is usually appropriate. The movement of the 
tool is defined by discrete positions along an NC path, which can also be used on a real 
machine. The step size between the tool positions is the feed per tooth defined by the 
process under consideration. The workpiece geometry of the i-th feed per tooth 


i-1 
Wi = Wo\ (JT; (3.13) 
j=0 
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is the difference between the model of the initial workpiece Wo and the union of the 
tool models T}, j = 0,...,i—1 for all previously processed discrete tool positions along 
the cutting path. The intersection between W; and T; represents the chip shape of the 
i-th cutting operation. For the calculation of process forces, this chip shape is sampled 
by rays that have their origin on the rotational axis of the tool. At each ray, the equation 


1-m; 
Fi= b-k do: (£) , 


do = 1mm, i € {c, n, t} [321] (3.14) 


is evaluated to calculate process forces in cutting, normal, and tangential directions. 
Here, d represents the chip thickness, b the width of the cutting segments defined by 
the rays, and kc, mc, kn, mn, kt, and m; the parameters to be calibrated. Using the 
directional information of the rays, the force vectors can be transformed into a global co- 
ordinate system in x-, y- and z-directions. To simulate tool vibrations, a set of decoupled 
damped harmonic oscillators represents the dynamic behavior of the machine-spindle- 
tool system. The parameter values of these oscillators have to be determined using 
measurements in advance so that deflections in x- and y-directions can be calculated 
as a vibration-induced displacement of the tool relative to the workpiece [662]. 


3.4.3 Learning from Simulation 


As it was mentioned before, even though geometric physically based simulations allow 
for the investigation of long-running milling processes in a reasonable runtime [728], 
they are not yet real-time capable. As a result, predictions can be generated only for 
process configurations that have been simulated beforehand. By contrast, ML models 
can be evaluated in real time and, thus, offer the opportunity to predict future unknown 
events based on an analysis of past data [128] or to classify the current process state, 
using a set of features extracted from measured data [366]. Therefore, new trends 
in applied ML aim to replace simulations with ML surrogate models that are more 
appropriate for real-time applications [116, 212, 559, 560]. In such a manner, not only 
predictions are generated in real time, but computational resources required for running 
simulations are also saved. 


3.4.3.1 Active Learning for Simulation Data Acquisition 

For the simulation of processes that exhibit a high degree of visco-elasticity or elasto- 
plasticity, numerical simulation methods that require high computational resources 
are often necessary [116, 212, 559]. Hence, we want to reduce the amount of data that is 
required to build an accurate surrogate ML model. The same idea was broached in the 
literature with AL. Starting from a small and non-optimal training set, AL procedures 
aim at selecting unlabeled data points whose inclusion in the training set improves 
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the performance of the ML model iteratively [593]. In this context, we develop an AL 
approach that selects the least amount of simulation configurations to learn an accurate 
surrogate ML model [559]. 

One simulation scenario is defined by a given combination of simulation input 
parameters and considered as a data instance characterized by a set of features. When 
the simulation runs for a given scenario, it is called a labeled data instance; otherwise, 
it is called an unlabeled data instance. The active selection can also be decided based 
on the label proportions of the instances that have so far been generated. In this case, 
the problem is formulated as Active Class Selection (for more details see Section 5.2.2 
in Volume 2).We want to reduce the cost of running process simulations for collecting 
labeled training data instances. Hence, our Hybrid AL approach (HDAL) combines 
error-based with distance-based methods to select the minimal number of simulation 
scenarios that are necessary to build an accurate ML model. We start by randomly 
selecting a small set of simulation scenarios (i.e. configurations) to run. Then, our 
framework operates in three stages. First, it trains the ML model on the available 
labeled scenarios. Second, it computes the training error measures of the ML model 
for each individual scenario in the labeled set. This step allows us to identify input 
data regions where the ML model is weak (i.e. uncertain about the label) and probably 
would need to see more samples from these regions for a better generalization. We 
select the labeled scenarios with the highest estimated error rates. In the third stage, we 
determine the closest unlabeled scenarios to these labeled scenarios. The assumption 
is that two close data instances probably share similar characteristics and thus similar 
estimated surrogate ML models. However, to avoid clustering problems around these 
labeled scenarios (i.e. already investigated regions), we devise a “min-max” selection 
procedure that chooses the furthest ones from the closest unlabeled scenarios. The 
entire process is iterated until a stopping criterion is met. The stopping criteria for AL 
procedures is still an open research question. It can be set according to a maximum 
budget of iterations or when the model accuracy improvement on an independent 
calibration/validation set over the last iterations becomes insignificant. 

We use a 3D FE simulation designed specifically for process-oriented computational 
simulations of shield tunnelling processes [116] to validate our framework. The simula- 
tion models predict soil displacements over time in different measuring points during 
tunnel excavation given two machine input parameters, namely the grouting and sup- 
port pressures. Each scenario results in a time series with 64 item steps of displacement 
observations over 154 monitoring surface points. 20 scenarios are initially selected for 
training. 130 scenarios are considered unlabeled and 10 are used for testing. The goal 
is to replace the simulation with an ML model to forecast future soil displacements. To 
do so, we use Vector Autoregressive with Exogenous time series features (VARX) as a 
surrogate ML model [559]. For VARX setting, the input data consists of the set of time 
series of settlements of the surface monitoring points (i.e. measured in millimeters), 
plus the time series records of the grouting and the support pressures (i.e. measured in 
Pa) of the machines, considered as exogenous variables. A L1-regularization is applied 
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to estimate the model coefficients. The VARX model is retrained with each update of 
the training set at each AL iteration. Our study shows the ability of our framework 
in reducing the number of training scenarios required for training up to 60 % while 
improving the prediction accuracy up to 20 %. 


3.4.3.2 Active Learning for Model Building 

Active learning can be applied not only to reduce data annotation costs (i.e. reducing 
the number of simulation configurations to run) but also as an informed sampling 
procedure to accurately learn the ML surrogate model. The surrogate model should be 
able to represent the capabilities of the simulation correctly and applied afterward in an 
online setting to predict process characteristics. Since the quality of ML model depends 
largely on the training data, we exploit the interaction between the surrogate ML model 
and the simulation using AL to actively and iteratively sample from the simulation data. 

In this context, in order to control milling processes in real time by adapting the 
process parameter values, we develop a novel ML framework based on a geometric 
physically based simulation of NC-milling process (cf. Section 3.4.2) [560]. We choose 
to focus more specifically on building a ML model to predict the Poincaré diameter, 
which is considered as a process stability criterion for a given simulation scenario, 
characterized by a given input spindle speed. Parts of this section are already published 
by the authors [206, 560]. 

Our experimental use case is a face milling process using a fixed width of cut of 
2mm, an increasing depth of cut from 0 mm to 1mm, a feed per tooth of 0.1mm and 
varying values for the spindle speed between 3000 min! and 15 000 min’. For the 
tool, a torus cutting tool with a diameter of 6 mm and a corner radius of 1 mm is used 
to machine aluminum alloy 7075. 

The proposed framework consists of a weighted ensemble of Multilayer Percep- 
trons (MLPs) trained on different subsets of features. In addition, an AL procedure is 
simultaneously used to iteratively design the optimal training set from the generated 
simulation data. A forecast for the Poincaré diameter is delivered every P4 millisec- 
onds for the next future P2 milliseconds. This is achieved by training a committee of 
MLPs, each on a distinct subset of process features [129]. Each model member of the 
committee is an online regression model, where the expected Poincaré diameter is 
assumed to be a function of the historical values of the time series features, namely the 
forces in the three-dimensional global coordinate system Fx, Fy, and Fz, the deflections 
in the bi-dimensional space Dx and Dy and the expected values of the chip volume. 
Since many configurations (i.e. various spindle speed values) are simulated, an AL 
approach based on the Query By Committee (QBC) paradigm [129] is used to iteratively 
add simulation configurations, whose inclusion in the training set should improve the 
prediction accuracy. The final prediction output is computed by a weighted average 
over the committee members’ outputs. The main steps of the framework are illustrated 
in Figure 3.13. 
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Fig. 3.13: Data acquisition and prediction framework [560]. 


Pool-Based Active Learning with QBC The output of the simulation depends to a 
large extent on the input value for the spindle speed. The goal is to identify simulation 
configurations whose inclusion in the training set improves the accuracy of the surro- 
gate ML model. One way to do this is to carefully design the training set by controlling 
the selection of training simulation configurations using AL. This control is given by a 
problem-dependent heuristic, e.g., the decrease of the estimated prediction error on a 
test set if a given configuration or a pool of configurations are added to the training set. 
One of the most popular AL approaches is built based on the QBC paradigm [593]. 

A committee of learners is built following different assumptions on subspaces 
of instances, which may easily lead to a huge number of hypotheses to cover and 
quickly become computationally intractable for real applications [442]. It can also be 
built on disjoint subsets of features [129], usually generated by the Random Subspace 
Generation (RSG) approach [129], which may also lead to an intractable number of 
hypotheses. In our approach, each feature subspace represents one characteristic of 
the original process (cf. Figure 3.13)[560]. The QBC procedure selects the simulation 
configuration on which the committee members’ predictions are maximally split. First, 
each committee member is trained on an initial small set of simulation configurations 
N;. Then, at each iteration, the set of Ne candidate configurations is sorted according 
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to a disagreement measure for each simulation configuration cs: 


q n 
baai 
fc = log | X = S08 - 75) | » (3.15) 


j=1 t=1 


where Ve is the predicted value by the committee member j at the instant t, and ĵ¢* is 
the estimated result of the committee composed of q models, on the scenario cs and 
obtained as follows: 7° = z BA 105) From the sorted configurations, the first Ns are 
selected to be added to the training set. This entire process is iterated until a stopping 
criterion is met, e.g. a maximum number of configurations in the training set is reached, 
or the accuracy improvement on an independent calibration/validation set over the 
last iterations becomes insignificant. To output the final predictions, we use the already 
trained committee of MLPs and aggregate them in a weighted ensemble model. The 
weights are computed offline using a normalized version of the loss of the model on 
the training set. During the AL procedure, the weights are updated with each update 
of the training set. This mechanism may help to achieve a blind adaptation to drifting 
characteristics in time series observations by adjusting the contribution of each model 
in the final output [200]. Let M = {M1, M>,..., Mq} be the committee of q MLPs and 
ea is the output of model M; for a given simulation configuration cs at a time instant t. 


j 
The final prediction output is obtained with: 


E A 98) 
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where y; is the error of model Mj on the recent obtained training set. To calculate such 
error, the Normalized Mean Squared Error (NRMSE) is used: 


x; € (0, 1], vj (3.16) 
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where Y§ = {y$5,..., y$, ... , yg} denote a time series of the Poincaré diameter for the 
simulation scenario cs and j;° is the predicted Poincaré diameter at the given instant t. 


Evaluation and Results The simulation generates one observation for each feature 
with a frequency of 20 kHz, resulting in a step size of 0.05 ms [560]. The simulation 
configurations are generated for 240 different spindle speed values in the range of 
3000 min! to 15 000 min ! with a step size of 50 min 1}. The set of simulation config- 
urations is randomly split into three independent sets. 180 are used for the training, 
while 20 others served to build a validation set. The remaining 40 are kept for testing. 
An aggregation period of P; = 10ms is used for the preprocessing and a period of 
P2 = 50ms is set as a forecasting horizon. The MLP model parameters are tuned using a 
grid-search procedure on the validation configurations. For the AL procedure, 50 from 
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Fig. 3.14: (a) Active learning performance. (b) Comparison between test and predicted data. (c) 
Depiction of Poincaré predictions in critical areas for three different spindle speeds [560]. 


the 180 simulation configurations of the training set are randomly selected to construct 
the initial training set. From the remaining 130 configurations, 15 are added at each 
iteration. A maximum number of 5 iterations is used as a stopping criterion. The subset 
of features is constructed using two lagged values of each of the five characteristic 
features Fx, Fy, Fz, Dx, and Dy. The corresponding spindle speed value, the time index, 
and two future values of the chip volume are added to enrich each subset of features. 

The results are presented from three perspectives [560]: 1) Figure 3.14(a) shows a 
comparison between the AL approach, a random sampling, and the performance of the 
ensemble, which is trained using the whole training set; 2) Figure 3.14(b) illustrates a 
comparison between the predicted Poincaré diameter values and the computed values 
on a subset of testing scenarios; 3) Figure 3.14(c) presents a comparison between the 
test and the predicted data for three different spindle speeds. 

Figure 3.14(a) illustrates the performance of the AL method. The AL approach 
reaches almost the same performance as does training over the whole training set, 
while using only 70 % of the training set. The results show the advantage of designing an 
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optimal training set especially when a large pool of simulation scenarios with randomly 
chosen input parameters is available. 

Figure 3.14(b) presents a comparison between the true and the forecasted Poincaré 
diameter values and illustrates that the ensemble is capable of accurately forecasting 
the Poincaré diameter in distinct scenarios and time periods. 

Figure 3.14(c) shows that our framework predicts process instability in time or 
earlier for three different spindle speeds. This highlights its usefulness for real-time 
applications, where process parameter values should be monitored and adjusted to 
avoid process instabilities. 


3.4.4 Fusion Between Simulation and Sensor Data 


In general, the integration of diverse data sources provides useful and enriched new 
data. We refer to this combination as a data fusion process [502]. However, it presents 
many challenges [468]. Fusing different sources can generate conflicts. The conflicts are 
most often the result of incomplete, erroneous, and out-of-date records [468]. Another 
challenge comes along with complex sequences that are multi-dimensional, multi- 
modal and time-varying [331]. In addition, the process of data fusion may also lead to 
larger amounts of data, which poses difficulties for online application [331]. Solving 
these challenges requires not only substantial efforts and domain knowledge but also 
scalable and principled fusion approaches that cope with real-time constraints. 

Process simulation can be viewed as background knowledge for domain experts. 
We want to integrate this knowledge into the machine learning process and, at the same 
time, use the simulation as an additional data source (i.e. generation of additional data 
points/data features or annotation of existing data). In this context, we aim at fusing 
both, simulation and sensor data, to predict active and passive forces in a real-time slot 
milling process [22, 206]. In [22], we present a framework that allows combining real- 
world observations collected from sensors and simulations at two levels: the data or the 
model level. At the data level, observations and synthetic data are integrated to form an 
enriched dataset for learning. At the model level, the models learned individually from 
observed and simulated data are integrated using an ensemble technique. Establishing 
a trade-off between model bias and variance, we perform an automatic selection of the 
appropriate fusion level. Figure 3.15 shows a schematic illustration of the conducted 
approach. Parts of this section are already published by the authors [22, 206]. 

To validate the developed framework, slot milling processes are conducted using a 
width of cut and depth of cut of 2mm, a tilt angle of 30° [206]. A ball-end mill witha 
diameter of 10 mm and two cutting edges is used to machine AISI M3:2, hardened to ap- 
prox. 62 HRC. During these milling experiments, process forces in x-, y- and z-directions 
are measured using a triaxial force dynamometer (Kistler 9257B) with a sampling fre- 
quency of 20 kHz. The goal of the considered use case is to predict upcoming active 
and passive forces, which are affected by tool wear. To this end, ten different process 
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Fig. 3.15: Framework for synchronizing and fusing simulation and sensor data [206]. 


parameter configurations with varying values for the cutting speed and feed per tooth 
are used. For each parameter configuration, 180 slots are milled in order to provoke 
tool wear. After every twentieth machined slot, the width of flank wear is measured for 
both cutting edges and averaged to obtain a value VBy which indicates the wear state 
of the tool [206]. Intermediate values of VB, for all other slots i are interpolated by 


inl H Fi-(VB);'-vBi,) 


i VB rUs Ow if FL>F 

Vrs 4a >B E (318) 
VBg , otherwise 

F} = |Fy| + |Fy| + |F}], (3.19) 


where i # O and Fz is a threshold that has to be defined in order to distinguish between 
signal and noise. The values VB), and vB 1 are estimated by measurements between 
which the interpolation is performed and k is the number of interpolated values between 
these measured values. Using this approach, high force amplitudes are assumed to 
induce a high load on the cutting edges, resulting in increased values for VBp [206]. 


3.4.4.1 Simulation-Sensor Data Mismatch Evaluation 
The fusion of data collected from both sources into a single data representation is not 
straightforward and data mismatch between both sources needs to be checked. Data 
mismatch may mean that different data sources attribute different values to the same 
instance. This can be addressed by ensuring the completeness and the correctness of 
each data source [468]. 

The mismatch is often caused by different data alignments due to different data 
sampling frequencies from simulations and measurements. This results in a time delay 
between simulated and measured data. Therefore, data synchronization is required. 
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In addition, the mismatch can be caused by learning from different underlying distribu- 
tions. In fact, due to the simplified models in the simulation system and the negligence 
of complex engagement behaviors, e.g. frictional effects, which are used to ensure a 
reasonable runtime, non-negligible deviations between simulation data and measure- 
ments of the corresponding process characteristics can occur, especially for process 
forces or tool vibrations. This deviation can differ for different process parameter values. 
As a result, a calibration of the simulation is required. Due to measurement noise and 
uncertainties, the quantification of simulation accuracy is a challenging task. 


Synchronization If both, simulated and sensor data, are acquired using the same 
sampling frequency, only a constant time shift between each time series has to be 
determined. This can be achieved, e.g., manually, using change points, estimated by 
auto-regressive approaches [740] or by analyzing the continuous wavelet transform of 
the time series. In the context of the latter approach, the transformed signal is given by: 


W(a, b) = She fzor (=) dt, (3.20) 


ja] 1/2 

where Z(t) is the original signal, (ib) represents the complex conjugate of a scaled 
and translated mother wavelet ¥(t), and a and b are the scale and translation parame- 
ters, respectively. Each scale corresponds to a frequency, resulting in information about 
the correlation between a given signal at a certain time instant and an investigated 
frequency without the need to make a trade-off between time and frequency resolution, 
which is a crucial issue of spectral analysis. In the milling application, the time-related 
delay between the two investigated time series can be identified by the points in time 
where the intensity of the wavelet transform at the tooth engagement frequency is 
greater than zero. 


Calibration For the considered geometric physically based simulation system, simu- 
lation models can be calibrated using measurements as ground truth. This has to be 
performed for each combination of the tool geometry and the workpiece material of the 
regarded process. For simulated forces, for example, the parameters of the force model 
p can be determined by applying an optimization procedure to minimize the squared 
Euclidean distance between simulated and measured forces 

n 

Up) = X> (Fsen(ti) - Fsim(, t)))” (3.21) 

i=1 
acquired using the same process parameter values for both, the machining process 
and the corresponding simulation conduction. To this end, any optimization algorithm 
could be applied to solve the minimization task. However, in practice, quasi-Newton 
approaches often outperform other methods. Using the Broyden-Fletcher-Goldfarb- 
Shanno (BFGS) [113] optimization algorithm, for example, an approximation of the 
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Hessian matrix H, is estimated, which is updated at each iteration k of the procedure. 
According to Newton’s method, the parameter values of the next iteration 
Pk+1 = Pk + &kSk (3.22) 
are given by the line search along the descent direction 
Sk = -HKV Lpi) (3.23) 
by estimating a, through 
a, = argmin L(px + asp). (3.24) 
The update of Hx is performed by adding a rank-two correction 


Hy, = Hy, + auu" + bw’, (3.25) 


u = ôk = Piri — Pko V = AV = HKV LOk) - VE (0x) (3.26) 


are typically chosen, so that the quasi-Newton condition 
HyrVk = Hye + auu” yy + bw yx = ôx (3.27) 
is satisfied, resulting in 


6,6, _ HkykykHk 
Bye Vp AY 


Hp = Hg+ (3.28) 


3.4.4.2 Automatic Fusion-Level Selection 
A sophisticated ML model should establish a trade-off between bias and variance. Such 
statement gives a guidance on how to automate the decision for the fusion-level se- 
lection. In fact, given a learned model f that approximates an unknown true model f, 
the expected mean-squared error between the target variable y = f(x) and the model 
predictions on an unseen sample x, can be decomposed into bias, variance, and an 
irreducible error term [334]. One way to reduce the variance-type error is to use an 
ensemble model [101], that combines many models into one single model using an 
averaging technique [556]. Such a statement is derived from the ensemble error decom- 
position into the average bias of the ensemble single models, variance, and a covariance 
term [694]. Brown, Wyatt, and Tino [106] have proven that when using an average-based 
ensemble model f with equal weights (i.e. f = 7, wifi. fi i € {1, +++ , N} are the 
single models), the expected error decomposition is given by: 

1 
N 


a[o -F00)"] = Bias’ + —Var+ a- ~)covar, (3.29) 
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where 
Po nats 
Var = = S°E[(fiGd - EffiG)])’], (3.30) 
i=1 
a) i N A 
ias = X J (E[fid] - E[f0d]), (3.31) 
i=1 
1 N N N 2 A J 
Covar = — >) 2 ELÊW - Effi(x)]) (G00 - E[fQ)])]. (3.32) 
i=1 j=1,j/i 


The variance in Equation 3.29 is the average variance divided by the number of base 
models N. When N is big enough, the variance term in Equation 3.29 will diminish. 
However, Equation 3.29 states that also the averaged bias and covariance should be 
taken into account while adding more and more models. In our setting, we are concerned 
with a small number of base models (i.e. mainly 2, one built on sensor data and the 
other one on simulation data). In addition, the decision of the transition from a single 
model to an ensemble has to be made. Therefore, it is more straightforward to deal with 
the whole term Var + (1 - 4,)Covar, as a variance-type error for the ensemble model 
and the corresponding bias as the average bias of single models Bias, since 


Bias(f) = (E[f| -f = (E[S> Af] -/) = Bias. (3.33) 


i=1 


The decision for the transition from a data-based fusion to a model-based fusion should 
be based on the level of the expected variance-type error together with the expected bias 
of the data-based fusion model. Let fsimulation and fsensors, two models each trained on 
simulation and sensors data, respectively, and a model frus be trained using a data- 
based fusion approach. From the decomposition in Equation 3.29, we can conclude 
that the model-based fusion using an averaged ensemble with equal weights contribute 
to reducing the variance-type error compared to the data-based fusion model if 


Cova r(fsimulation if sensors) < T, (3.34) 


where the threshold T = 2 (Var (ftus) Var (simuation)+ Var(fsensors) ye 

Equation 3.34 complies with the decomposition in Equation 3.29, stating that a 
lower covariance is always desired to reduce the overall ensemble error. It also confirms 
that enforcing a degree of diversity between the ensemble members through low covari- 
ance is favorable as it reduces the ensemble ambiguity, presented in the more general 
ensemble error decomposition schema [355]. Furthermore, once the single models are 
trained and built, the average covariance term can be estimated entirely without any 
knowledge of the true data labels or the real function f to be approximated. From 
a practical point of view, this result confirms the usefulness of using simulation for 
enriching data with samples that reflect different patterns than the ones observed with 
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sensor data. This enrichment includes either new features that cannot be measured by 
sensors or observations that cannot be detected with sensors. 

The bias of the ensemble model will be equal to the average bias of the single models 
in case of equal weights. So, it will achieve a lower bias than the single model with 
the highest bias. If the covariance between single models is lower than the threshold 
T in Equation 3.34, then the system compares their average bias with the bias of the 
data-based fusion model to check if the variance-type error reduction with the average 
bias will establish a better bias-variance trade-off or not. If it is not, the system sticks to 
the data-level fusion. 

The systematic fusion-level selection is validated by the previously described slot 
milling process [206]. We measure the process forces (Fa and Fp ) using sensors, aggre- 
gate them using an aggregation period of 0.1 ms, and predict the future expected forces 
each 0.1 ms for the next 10 ms. Monitoring the milling process forces enables control of 
both, process stability and quality [560]. 10 different process scenarios are investigated 
by varying the input parameters, namely speed, and feed. The resulting length of the 
time series for each scenario depended on the values of the input parameters and varied 
from 13250 to 54500 observations. The simulation is used for feature enrichment by 
generating features that typically cannot be measured during the process. For this 
purpose, the chip volume, the sum of time of engagement, the feed, and the mean of 
the cutting speeds are generated for each point in time to potentially enrich the feature 
space of the force measurements [206]. 

After solving possible mismatches between the sources, a unified feature set is 
created by joining new features generated by the simulation together with the lagged 
sensor measurements of the forces as sensor features [206]. The Random Forest regres- 
sor (RF) [101] is chosen for the prediction of Fa and Fp. The results are presented for 
10 cross-validation folds, where 9 scenarios (i.e. time series) are kept for training and 
1 scenario for testing for each fold [206]. The prediction error is evaluated using the 
NRMSE (Equation 3.17). We used 5 lagged values for the time series of the forces as 
sensor features. For each time step, lagged sensor values are joined together with the 
simulation features for the current time step (i.e. simulation features are pre-calculated 
and stored, and only sensor data is streaming). We have also devised a binary feature 
based on the simulation features called the activity feature, which indicates the engage- 
ment situation of the tool (0: no engagement, 1: engagement) and is added to the fused 
set of features. 

The results in Table 3.5 show that the feature-based fusion model outperformed 
models trained separately on each data source. The feature-based fusion is automati- 
cally selected as the best way to perform data fusion using simulation and sensor data 
after empirically computing the threshold r derived in Equation 3.34 for the covariance 
and the average bias of the single models trained separately on each data source. Our 
theoretical insights are validated by showing a comparison with the model-based fu- 
sion [206]. These results are presented in Table 3.5. Furthermore, examples of empirical 
evaluations of the covariance, the threshold t derived in Equation 3.34, the average 
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Tab. 3.5: Comparison between the NRMSE of predicted active and passive forces using different 
methods. 


Method | RF simutation | RF sensors | RF Feature-based fusion | RF Model-based fusion 
NRMSE F, 15.29% + 3.70% 24.24% +5.95 % 9.95 % + 2.90% 16.25 % + 4.27 % 


NRMSE F, | 21.50% +#11.15% | 33.63% +5.38% | 13.30% +6.30% | 19.69% +5.37 % 


Tab. 3.6: Comparison between different measures for the fusion-level selection. 


Measure Fa Fp 

Covar (RFsimulation» RFsensors) 21.18 4750.80 
Threshold t 109.92 -13092.95 
Average bias (RFsimulations RFsensors) 782.50 13714.16 
Bias (RF Feature-based fusion ) 121.21 3204.13 
Var (RF Feature-based fusion ) 330.40 8614.31 


Var (RFmodel-based fusion ) 140.86 13367.52 


bias of single models, and the empirical bias of the feature-based fusion model for 
the predictions of Fa and Fp are shown in Table 3.6 to describe how the fusion level 
selection is made. In addition, the model variances of the model-based (i.e. ensemble) 
fusion and the feature-based fusion are reported. For Fa, the value of the covariance 
between single models is lower than the threshold, which guarantees that computing 
the ensemble model will reduce the variance type error and this is also confirmed by 
the reported empirical variance values in Table 3.6. However, validating the covariance 
threshold is not sufficient. The average bias of single models should also be compared 
to the bias of the feature-based fusion model. Comparing these values clarifies whether 
the model-based fusion will contribute to reducing the variance type error, but also 
alters the bias by increasing it with approximately a factor of six. For Fp, the value of the 
covariance between single models is higher than the threshold, which indicates that 
computing the ensemble model will not contribute to reducing the variance type error. 
In addition, the bias of the feature-based fusion is lower than the reported average bias. 
This observation confirms that the model-based fusion will reduce neither the variance 
nor the bias. 


3.4.5 Initialization of Simulation Models Using ML Methods 


First investigations are performed with respect to the initialization of simulation models 
using machine learning methods. To this end, two different simulation systems are 
considered, which are developed to investigate the milling and grinding processes, 
respectively. Parts of this section have already been published by the authors [207, 727]. 
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3.4.5.1 Learning of Frequency Response Functions and Compliance Models 

The underlying system of a geometric physically based milling simulation (cf. Sec- 
tion 3.4.2) models tool vibrations by a set of decoupled damped harmonic oscilla- 
tors [662]. The parameters of these oscillators have to be calibrated by measurements of 
the Frequency Response Function (FRF) of the considered machine-spindle-tool system, 
acquired by impact hammer tests [325]. In addition, the dynamic behavior of this system 
changes with different poses of the tool. The modeling is performed separately for all 
considered spatial directions- Here this corresponded to the x- and y- directions. Each 
oscillator is parameterized by identifying values for the modal mass mm, the natural 
frequency fm, and the damping constant ym. The complex response function [207] 
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is represented by superposing the amplitude and phase of the parameterized oscillators 
q for each angular frequency w. By using measured FRFs, the loss function 
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can be derived, which is the sum of the squared deviations between the normalized 
values of the calculated amplitudes A, and phase yp and the normalized measured 
data A; and ĝ+. 

Two different learning tasks are pursued in the performed research. On the one 
hand, pose-dependent FRFs are learned in order to reduce the measurement effort. 
On the other hand, the resource- and time-consuming and only semi-automatic task 
of calibrating the oscillator parameter values for different tool poses is replaced by 
machine learning. 

The amount of data needed to perform the investigations is obtained by frequency 
response measurements in our laboratory for two different machine tools, Heller FT 
4000 (M1) and DMG HSC 75 linear (M2), using a centrally composed statistical exper- 
imental design with star points. A total of 46 and 49 poses are measured for Mı and 
Mp, respectively. An impulse hammer (Kister 8206) are used to excite a ball end mill 
(Fraisa X7400) with a diameter of 10 mm. The impulse response is measured by an 
accelerometer (PCB Piezotronics 352C23) attached to the tool tip [207]. 
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Fig. 3.16: Predicted and measured FRFs for machine tool M1 [207]. 


As mentioned, two different objectives are investigated. For both objectives, let X be a 
set of J x N features sampled from an unknown distribution D and Y bea set of K x N 
targets labeled by a labeling function. The first objective involved the prediction of FRFs. 
For this, the measured FRFs of all P considered measurement poses are discretized into 
data points by the frequency resolution Af such that M is the number of frequencies 
examined for each pose. Each of the N = P- M data points contains a number of K targets 
that included compliance amplitude and phase shift for the x- and y-directions of the 
machine coordinate system. Let J be the number of features consisting of the frequency 
and the positions of the three axes defining the pose. For the second learning task, 
which is the prediction of modal parameter values for given poses, let J consist of the 
three pose-dependent features and N = P. For the targets, let K = 3-(Qx + Qy), where Qx 
and Qy are the number of oscillators in the x- and y-directions, respectively. Using this 
approach, the learning task attempts to represent the relationship between different 
interdependent oscillators of each compliance model across the two different vibration 
directions as well. For both learning tasks, the goal is to find a learner h : X > Y with 
respect to the distribution D [207]. 

Figure 3.16 shows an exemplary comparison between measured and predicted 
FRFs for two different poses using machine tool M,. The phases are predicted with 
high accuracy in both x- and y-directions for both test poses. The amplitudes in the 
x-direction are predicted with a nearly non-visible deviation from the measured curves 
for both tests poses. There are two peaks visible in measured FRFs in y-the direction 
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Fig. 3.17: FRFs resulting from predicted and manually calibrated compliance models for machine tool 
M2 [207]. 


in a frequency range of 1200 Hz to 1700 Hz, which are very distinct for Pl For Pa 
the second peak can hardly be detected. This behavior could be represented by the 
model to a certain extent. In addition, the model also predicts a visible second peak for 
the remaining test poses, but the differences in the distinction between the two peaks 
across the poses can not be achieved. This effect is observed to be minimal in the data 
examined and needs to be analyzed in more detail in future research activities. In order 
to represent such behavior, more observations in which the fusion of peaks is present 
and that can be considered for the training procedure would be necessary. 

Figure 3.17 shows a comparison between FRFs that resulted from predicted and 
manually calibrated compliance models in x- and y-direction for one specific pose using 
machine tool M2. Generally, a high accordance is observed. Examining the zoomed-in 
areas of the FRFs, it can be seen that the shape of the measured FRF can only be repre- 
sented coarsely by the manual fitting procedure. Since the fitted oscillator parameter 
values serve as target values for the learning task, the FRF, which is calculated based 
on predicted oscillator parameter values, can not reproduce the measured behavior in 
higher detail than the FRF that results from the fitting procedure. Nevertheless, there are 
only small deviations between the fitted and predicted data. Therefore, the learning of 
oscillator parameter values directly from given poses can be interpreted as successful. 
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Fig. 3.18: Images of (a) measured depth information of grinding tool topographies and (b) the corre- 
sponding segmentation mask [727]. 


3.4.5.2 Augmented Semantic Segmentation for the Digitization of Grinding Tools 
The following results are based on the investigations of [727]. The stochastic nature of 
grinding processes entails several challenges regarding process simulations. For mod- 
eling grinding tools, the choice of the methods for representing the individual grains 
and grain shapes have a significant influence on the accuracy of simulation results. Es- 
pecially for single-grain scratch simulations and FE-based analyses, the identification 
of grains that adequately represent the overall characteristics of the tool used is crucial. 
In order to perform this identification successfully, an analysis of a huge amount of 
grains that have to be manually separated from the bond, is necessary. We developed a 
learning-based methodology, to automate this separation for digitized grinding tools 
by semantic segmentation. We have focused in particular on evaluating the prediction 
accuracy of the grain boundaries to be able to distinguish neighboring grains. This is 
crucial for a subsequent automated extraction of the grains. Figure 3.18a visualizes the 
measured depth information of grains in the bond. In addition, a manually generated 
segmentation mask is shown in Figure 3.18b. For the semantic segmentation, a novel 
neural network architecture [727] is developed, which is based on Fully Convolutional 
Networks (FCN) [420] (see Figure 3.19). In contrast to conventional FCN architectures 
found in literature, the channel information is gradually down-sampled by half in each 
transposed convolution layer instead of performing a single reduction operation. In 
addition, the up-sampling of the image dimensions is also spread across three twofold 
up-sampling steps. 

Out of 4678 grains, 500 are used for testing purposes [727]. For hyper-parameter 
identification, random search [57] is used. In order to evaluate the prediction accuracy, 
the pixel accuracy [420] PACC = 5°; Nj;/ >>; M; is used, where Nj; is the number of 
pixels of a class i which are predicted to belong to class j and M; = ` j Nj is the number 
of pixels of class i. In addition, the boundary pixel accuracy (BPACC) is calculated as 
the pixel accuracy of the boundary pixels of each grain, which are estimated using a 
border following algorithm [665]. 

Figure 3.20 shows the results for grain segmentation for different numbers of grains 
incorporated for training using the developed approach versus applying an FCN-8. The 
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Fig. 3.19: FCN architecture including (a) the convolutional network, (b) a deconvolutional network 
based on FCN-8, and (c) the developed approach for the deconvolutional network [727]. 


conventional FCN-8 delivers insufficient results for all three investigated numbers of 
grains used for training. By contrast, using the developed FCN architecture, it is even 
possible to distinguish closely neighboring grains, if 2575 or 4178 grains are used for 
training [727]. 

To successfully train ML models, manual segmentation still has to be conducted 
in order to establish the required feature/target correspondences. To this end, data 
augmentation is used, to drastically reduce the necessary number of measurements. 
Different image manipulation techniques [605], e.g., rotation, flipping, or noise injec- 
tion, are combined in a random sequence, to increase the amount of training data. In 
order to quantify the degree of augmentation, the augmentation factor per image (AFPI) 
is used as the number of generated images for each image based on measurements in 
a combined training set. Figure 3.21 shows the segmentation results using different 
numbers of grains used for training and different values for the AFPI, for the PACC, 
and the BPACC. The PACC value is higher the more grains are used and the higher the 
AFPI is. However, for the BPACC, a local optimum can be identified, indicating that 
high PACC values will not necessarily result in good segmentation results of the grain 
boundaries. Furthermore, choosing the AFPI as low as possible results in significantly 
lower training runtime [727]. Since grinding tools are constantly affected by tool wear 
during the process, the transferability of the model corresponding to the local opti- 
mum to different states of tool wear is also investigated. For further details, see the 
corresponding publication [727]. 
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Fig. 3.20: Segmentation results using the developed approach in comparison with using an FCN- 
8 [727]. 
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Fig. 3.21: Segmentation results with varying number of grains used for training and values of the 
AFPI [727]. 


3.4.6 Summary and Conclusions 


This contribution has presented investigations that highlight different aspects of the 
combination of sensor and simulation data for scientific insights into machining. The 
methods are illustrated by milling, grinding, and tunneling processes as case studies. 
The main resource restriction in machining applications is processing time. Ensemble 
learning methods enable the real-time capability of simulated predictions. ML enhances 
the simulation. Another resource, which is often restricted in real-world applications, 
is (labeled) data. Various data fusion strategies were discussed to combine sensor and 
simulation data for real-time predictions of process characteristics of milling operations. 
Here, simulations and ML help each other. 

In addition, ML methods were used to initialize simulation models, specifically the 
dynamic model of a geometric physics-based milling simulation system and the tool 
model of a grinding simulation. ML helps the simulation. 
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The presented results emphasize the significant potential of combining sensor data, 
simulation results, and ML methods for the analysis and optimization of manufacturing 
processes in the context of Industry 4.0. 
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3.5 High-Precision Wireless Localization 


Janis Tiemann 


Abstract: Recent developments in Ultra-Wideband (UWB) wireless communication 
enable wireless localization as a link between the digital and the physical world. With 
the technological advances, the achievable precision and accuracy is increased dramat- 
ically such that novel applications exploiting this precise cyber-physical link become 
feasible. Autonomous swarms of robots, precise and scalable tracking of goods or 
safety applications are within the reach of this potential. However, the increase in 
communication required for such capabilities to become feasible is constrained by 
bounds of channel utilization, energy consumption, and intelligent information dis- 
tribution. Therefore, novel approaches for maximizing information and localization 
throughput while minimizing channel utilization and power consumption and main- 
taining precise localization results are crucial to overcoming technology barriers and 
ultimately enabling a connected cyber-physical world. Promising approaches are novel 
localization-specific protocols to coordinate channel access among localization targets 
in order to achieve reliable data rates while minimizing actual power consumption. 
Further, intelligent approaches are required to increase the achievable accuracy for 
these resource-efficient localization approaches such as Time-Difference of Arrival 
(TDOA). In those cases, additional parameters of the radio channel can be exploited 
to obtain quality indicators for measurement and mitigate outliers or generally im- 
prove the localization accuracy through adequate estimation. In the following, the 
requirements of several applications are analyzed, and an overview of the solution 
space in terms of channel utilization, energy efficiency, and accuracy is given. Based 
on these requirements, solution approaches are presented to improve both channel 
utilization and energy efficiency. Further, approaches to increase the achievable ac- 
curacy in challenging environments are illustrated and evaluated. It is shown that 
novel approaches for high-precision wireless localization enable novel applications by 
employing localization-specific protocols and methods to improve the accuracy despite 
challenging conditions. 


3.5.1 Introduction: Precise, yet Scalable Wireless Localization 


Wireless localization is seen as an enabler for many applications requiring a link be- 
tween the physical and the digital world. Many approaches exist to achieve this connec- 
tion utilizing a wide range of technologies. An overview of the capabilities in terms of 
accuracy and range of those technologies is given in Figure 3.22. The diagram illustrates 
the difference between widely adapted communication technologies, such as cellu- 
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lar networks and consumer-grade wireless standards, and more application-targeted 
standards, such as UWB. Early methods for localization were solely based on cell-id 
and/or sector differentiation. However, novel approaches are capable of increasing 
the accuracy for tracking and guiding applications. More dedicated localization sys- 
tems like Global Navigation Satellite Systems (GNSS) can achieve exact localization 
results but are limited to outdoor Line of Sight (LOS) operations, as they suffer severely 
from multi-path fading. Here, UWB technology is key to overcoming the technology 
barrier, enabling highly precise indoor localization through precise Time Of Arrival 
(TOA) estimation. For localization methods enabled by 5G mmWave communications, 
see Section 5.5. 
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Fig. 3.22: Illustration of the field of localization capabilities of wireless communication technologies 
in terms of accuracy, range, and place of usage. 


Hence, a significant amount of research is challenged by this newly available technology. 
Integrated UWB solutions enabled a new degree of accuracy in wireless localization. In 
contrast to many signal strength-based methods, these TOA-based approaches emerged 
as the most promising candidate for accurate and reliable measurements in the cen- 
timeter range. Due to the usage of high bandwidths that enable sharp pulse-based 
modulation, these UWB systems are capable of resolving many multi path-induced 
errors in TOA estimation [228]. 

Due to this reason, research in several areas of application arose utilizing this 
newly available connection between the digital and the analog world. One particular 
area of interest is the field of massively scalable and low-power localization for logistics 
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in the Industry 4.0. Here, low-power localization devices are key to monitoring and 
optimizing the whole production process, as illustrated in Figure 3.23. 

For large-scale deployment, however, the scope of the most current research is 
insufficient as it neglects scalability and multi-user interference by utilizing localization 
approaches that require the exchange of many messages. This introduces not only 
significant resource usage in terms of channel utilization, but also requires substantial 
usage of energy for message exchanges. In summary, the addressed points of this 
section are listed as: 

— Massive Multi-User Scalability 
— Minimal Energy Consumption at the Mobile Units 
- High Accuracy Suitable for Control-Grade Applications 


3.5.2 Related Work: Evolution of wireless localization within CRC 


The methodological continuity of the work within the CRC is illustrated by several 

publications: 

- UWB Indoor Positioning for UAVs (Unmanned Aerial Vehicles) [682] uses Two- 
Way Ranging to enable UAV Indoor Navigation. Limitations in multi-user scalability 
motivated further research. 

- Multi-User Interference Analysis [685] is a detailed look into multi-user inter- 
ference for UWB systems and wireless clock synchronization. Further analysis is 
conducted in [687] which provides an analytical model for the interference. 

- ATLAS - TDOA-Based Localization [684] overcomes multi-user scalability limita- 
tions and presents an open-source approach. 

- Scalable Multi-UAV Indoor Navigation [688] demonstrates the scalability and 
accuracy for control-grade systems. 
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Fig. 3.23: Illustration of a potential usage scenario. 
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- Enhanced UAV Indoor Navigation [681] improves the precision and range for 
control-grade systems. 

— Further work provides an open-source extension ATLAS FaST to the ATLAS ap- 
proach in order to improve energy efficiency and reliability [678, 679, 687]. 

— Finally, the PhD thesis Scalability, Real-Time Capabilities and Energy Efficiency 
in Ultra-Wideband Localization incorporates and extends the key findings of the 
research, see [683]. 


3.5.3 Approaches: Scalable, Real-Time Capable Energy Efficient Localization through 
UWB 


The following sections present approaches for high precision wireless localization 
based on the work in [680, 683, 687]. The approaches are tailored to find a sweet spot in 
the trade-off between multi-user scalability, energy efficiency, and achievable accuracy 
for wireless localization. In a first step, the underlying ATLAS localization system is 
highlighted briefly. 
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Fig. 3.24: System architecture of the ATLAS RTLS utilizing the FaST scheduling approach. The ap- 
proach is based on the robot operating system (ROS) to allow for flexible and modular design as well 
as seamless integration with mobile robots. 


3.5.3.1 ATLAS: Open-Source TDOA-Based Localization 

In the context of the ATLAS Real-Time Localization System (RTLS), we propose building 
upon the TDOA topology in which the mobile nodes transmit a single message and the 
infrastructure receives (R-TDOA). Wireless clock synchronization is used to achieve a 
common time base among the clocks of the static infrastructure-based anchor nodes. 
Yet, random access is incapable of providing guaranteed update rates required by many 
robotic applications. Furthermore, with an increase of mobile nodes, the quality of 
the localization will degrade due to missed synchronization frames, as pointed out in 
[685]. This means that the real-time requirements for control-grade applications such 
as indoor UAV navigation cannot be met at scale by random access. 
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3.5.3.2 ATLAS FaST: Lightweight Scheduling for Control-Grade Applications 

Based on previous work [684], we thus propose a novel lightweight scheduling protocol 
that seamlessly integrates with the wireless clock synchronization needed for proper 
operation, see [687]. As depicted in Figure 3.23, our approach aims to support scalable, 
low-power, reliable, and real-time capable wireless localization. In the following, we 
will document the approach and experimentally evaluate its capabilities in an industrial 
setting. The open-source implementation provided will enable the usage of scalable 
UWB localization in the robotics and automation community, see [678, 679]. 
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Fig. 3.25: Temporal structure of the ATLAS FaST localization specific Time Division Multiple Access 
(TDMA) scheme. The structure utilizing subframes enables multiple spatially distributed synchroniza- 
tion cells. 


The publication [687] illustrates the continuity of the development of wireless local- 
ization building upon the widely used Robot Operating System (ROS). In contrast to 
previous work, which focuses mainly on plain localization aspects, a significant in- 
crease in scalability and real-time capabilities is provided, presenting a method that 
allows for scalable localization without degradation in the performance of the localiza- 
tion results. 

The proposed system architecture is depicted in Figure 3.24. The ATLAS Concentra- 
tors are capable of connecting to multiple anchors and synchronize anchors through 
direct connection. 

The chosen TDOA topology used in our system design enables our mobile nodes to 
be very energy efficient, as pointed out in Section 3.5.3.3. Only a single frame needs to 
be transmitted in order to obtain a full localization result. However, if a guaranteed 
update rate is desired, bi-directional communication, synchronization, and association 
with the system’s scheduler are required. Through the modular system architecture, the 
individual components can be developed independently, which significantly reduces 
the time required to integrate application-specific features or competition-specific con- 
straints. As mentioned before, the modularity of this concept is illustrated in Figure 3.24. 

To lower the overall system complexity, the synchronization request is the same 
as any other positioning frame transmitted over the UWB channel. It consists of the 
tags Extended Unique Identifier (EUI) and a sequence number that is increased with 
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every new message. Depending of the application this message can be extended using 
Inertial Measurement Unit (IMU) data or battery status. 

After accessing the channel, the tag immediately goes back to sleep and waits 
for the next sync frame. In the meantime, the anchor nodes receive and process the 
requests. Successful random access requests are propagated to the scheduling engine 
that has a database of pre-known period configurations and priorities from which it 
can prepare a response through the requesting tags EUI. This procedure simplifies the 
configuration overhead as the system is infrastructure-based. It also allows for seamless 
graceful degradation of the update rate if the overall system capacity limit is reached. 


3.5.3.3 ATLAS Low-Power Timekeeping under Resource Constraints 

For random access based schemes, proper absolute time-keeping is not required. For the 
proposed scheduled access though, knowing the correct absolute time is of the essence. 
However, maintaining the high-frequency clock of the transceiver modules is associated 
with significant loss of power and therefore, battery life. Due to this reason, a concept 
of time-keeping, synchronization, and constant calibration should be employed for 
proper low-power operation in the scheduled scheme. This basic concept can be utilized 
for real-world implementations in combination with ATLAS FaST to support future 
low-power applications. 

The transceiver should run in a mode that does not enable the high-frequency 
clock sources and, thus, loses its absolute time knowledge. Therefore, the proposed 
scheme employs intelligent switching between different clock sources to maintain the 
transmission schedule in the mobile nodes. 
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Fig. 3.26: An approach for low-power association with the system controller. 


The initial association procedure at the mobile node is depicted in Figure 3.26. Tra is 
the duration between the sync frame and random access, Tp(k) the duration between 
the response frame and the k" positioning frame, Tm the duration of the master frame, 
and k is the localization sample iterator. There are three main components of drawing 
power from the battery: the transceiver, the host controller, and a Real-Time Clock 
(RTC) within the host controller. Here, the internal RTC is used as the keeper of the 
overall system clock. The transceiver oscillator is calibrated using two consecutive 
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UWB synchronization frames. The resolution of the RTC is insufficient to calibrate upon 
initial association. Here, the re-association should be utilized as it provides a sufficient 
time-span enabling RTC drift calibration. 
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Fig. 3.27: Approach for low-power re-association with the system controller. 


The typical re-association procedure, including the following transmissions, is depicted 
in Figure 3.27 where Np is the positioning period exponent for a mobile node, Nsps the 
number of slots per subframe and T; the slot duration. The transceiver and the host- 
controller can be configured to a sleep state. An RTC timer can then be configured to 
wake up the host controller for the next positioning frame. Here, the delays for waking 
up the host controller and the transceiver need to be calibrated in the implementation 
phase. The processing times and oscillator start-up times need to be accounted for. 
Based on this, the RTC wake-up time can be configured. Although limited precision is 
given by the RTC, which mostly runs at 32.768 kHz, the available resolution of around 
30.52 ps is sufficient to stay within the margins of the scheduling scheme. However, the 
potential error through clock-traversal needs to be accounted for in the slot-duration 
dimensioning for the scheduled scheme. 


3.5.3.4 Wireless Signal Assessment to Improve Overall Accuracy 

Due to the utilization of a TDOA-based localization scheme, another variable in the 
localization solution is introduced. Instead of rangings as with TWR-based localization, 
TDOA utilizes the time difference of the arriving signal. Hence, the localization is 
inherently more challenging. Therefore, countermeasures such as introduced in [680] 
are required to obtain precise localization results. 

One of the main benefits of UWB is the availability of additional information of the 
received signal. Here, methods can be applied that leverage this information in order 
to weigh the individual measurements. Figure 3.28 illustrates this concept. Here, the 
ratio between the first path and the remaining energy of the channel impulse response 
is taken in order to weigh the measurements in the Extended Kalman Filter (EKF) used 
to obtain the localization results. 
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Fig. 3.28: An approach for signal quality analysis to improve accuracy. Note that the depicted chan- 
nel response is interpolated from ten experimentally recorded sets in which the resolution is 1ns per 
sample. 


There are concepts that may be capable of extracting even more information such that 
even passive localization or simultaneous localization and mapping in certain scenarios 
are feasible. An initial evaluation of machine learning on these channel responses can 
be found in [686]. 


3.5.4 Results 


In order to evaluate the performance characteristics of the presented approaches, se- 
lected evaluations based on the work in [683, 687] are discussed. 


3.5.4.1 Scalability of the ATLAS FaST Approach 

One of the main benefits of random access is the simplicity of implementation due to the 
lack of coordination. Tags can be implemented in a transmitter-only design, allowing 
for a long battery lifetime through intermediate sleep modes. The main down-side, 
however, is the lack of predictability. For many applications, especially in real-time 
control of autonomous systems, defined update rates are required. Here, a higher overall 
throughput is not directly useful, if long inter-arrival times are expected. Therefore, 
systematic channel access is desirable. 

With increased loads, low energy, battery-powered applications suffer from the 
non-reception of localization frames. Therefore, the effective energy per position ratio 
increases, leading to decreased efficiency. The ATLAS FaST scheme is designed to 
overcome these issues. In the following, the achievable inter-arrival times and reliability 
are analyzed. 7 ,¢ considers a processing time of Tproc and the partial preamble reception 
effect. 
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The resulting throughput is the superposition of these individual effects that depend 
highly on the implementation of the receiving side in practical systems. Due to the 
unique characteristics of the UWB PHY, a non-destructive R-TDOA scheme is feasible 
as the throughput does not degenerate with increased loads in the available ranges. 
However, due to the non-reception of frames, the real-time capabilities for the localiza- 
tion systems degrade as a defined update rate cannot be guaranteed. This is especially 
severe for applications, requiring tight real-time constraints. 


R-TDOA, Random Access vs. ATLAS FaST Scalability over Variation of Nrasps 
Npsdu = 96bit, Neg = 8, R = 6.8Mb/s, fpr = 62.4MHz, Ts = 488US, Tproc = 254uS 
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Fig. 3.29: Experimental vs. analytical throughput analysis for R-TDOA-based localization in which the 
infrastructure estimates the location. 


ATLAS FaST overcomes this issue. Due to the slotted approach and the competition-free 
channel access in the scheduled slots, the successful positioning throughput scales 
linearly with the positioning frame load, as depicted in Figure 3.29. Here Nysay is the 
number of transmitted payload bits per packet, N,rq the UWB start of frame delimiter 
size in symbols, R the effective data rate, and fpr the mean pulse repetition frequency 
on the physical layer. The slot duration Ts is chosen such that there are 2048 available 
slots per second. Non-reception due to noise-induced frame error rates were neglected 
in this analysis as this will highly depend on the link budget the wireless localization 
system planner will provide for the given setup. However, the variation of the number 
of random access slots will define the upper bound of the system’s capacity. 

Alongside the throughput of the proposed FaST approach, Figure 3.29 depicts an 
analytical model for R-TDOA obtained in a scaled experiment. Here we can compare the 
multiuser scalability of the proposed approach with the previous, random access-based 
R-TDOA. 
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It is clear that the scheduled approach allows greater throughput than that produced 
by the same underlying positioning load using random access-based R-TDOA. Also, 
the analytical model for the R-TDOA throughput matches the experiment closely. The 
current implementation supports only static adjustment of the amount of random 
access slots Nyasps; for optimal performance a dynamic adjustment would be necessary. 
However, even when using a static set of six random access slots, FaST supports more 
than 1000 mobile units at 1 Hz considering a high re-association interval. 


3.5.4.2 Energy Utilization of Localization Schemes 

In order to provide a simplified comparison between the energy consumption at the mo- 
bile unit for different channel access schemes, the bar chart in Figure 3.30 summarizes 
the results of this work relative to the mostly used Symmetrical Double-Sided Two-Way 
Ranging (SDS-TWR)-based topology now considered state of the art. 

Here, a minimal set of four anchors is the baseline for all topologies. Multiple 
different ranging or localization schemes are considered: R-TDOA with random access 
as introduced in the early ATLAS implementations [684]; T-TDOA as illustrated by 
[256, 378] with transmitting anchors and a receiving-only mobile unit, similar to GNSS; 
typical single-sided Two-Way Ranging (TWR), which requires a message exchange of 
two messages per ranging; TWR with Multiple Acknowledgments (TWR-MA), utilizing 
a repeated response to estimate clock offset; combined TWR, which orchestrates a TWR 
exchange by pre-defined individual response time offsets for a set of anchors; and, 
finally, symmetric double-sided TWR (SDS-TWR), which utilizes symmetric response 
times to cancel out clock drift during ranging. The last one requires three messages for 
basic SDS-TWR, but it can also be configured to report calculated ranges in SDS-TWR-R, 
requiring another message at the end of each ranging. For the SDS-TWR topology no 
reporting from the anchor side is considered. So the range information is available 
only at the infrastructure side as it is using R-TDOA or FaST. But SDS-TWR-R considers 
reporting and therefore consumes even more energy than plain SDS-TWR. 

Keep in mind that the energy consumption of the TWR- and T-TDOA-based topolo- 
gies increases linearly with the number of anchors in the setup, while the energy con- 
sumption of R-TDOA-based localization is independent of the anchor count. Therefore, 
it can be stated that even at the worst-case scenario the proposed topology outperforms 
traditional TWR-based schemes in terms of scalability and energy usage. 

By calculating the power per transmitted and received UWB frame for the 
transceiver system and the fundamental requirements for the localization topolo- 
gies, er can determine an idealized maximal battery life. An exemplary positioning 
rate of 1Hz, a re-association period of 300s, and a reliability look-ahead N, = 1 were 
chosen. 

The power consumption is calculated based on the datasheet of the DW1000 
transceiver resulting in E;,=40.7 mJ and E;,=76.9 mJ. Note that the different PSDU 
(Physical layer Service Data Unit) sizes, that are required for the different schemes 
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are simplified to the lowest size of the R-TDOA scheme due to the implementation- 
dependency of this value. The discrepancy between TDOA- and TWR-based schemes 
would be even greater. 


Energy Consumption at Mobile Unit per Position Compared 
to SDS-TWR for a set of Na = 4 Anchors for 3D Localization 
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Fig. 3.30: Exemplary analytical evaluation of energy utilization per localization approach. Note 
that the main factor influencing the energy consumption is the amount of messages required per 
localization result. 


The resulting battery lifetime is depicted in Figure 3.30. The battery lifetime of the 
T-TDOA- and TWR-based approaches is lower by orders of magnitude due to the de- 
pendency on the number of anchor nodes Na inherent in those topologies. For the 
R-TDOA-based approaches, the FaST results are close to those of random access, which 
is the baseline for low power consumption at the mobile node because there is no 
additional overhead. Therefore, it can be stated that through the planned scheduling 
with infrequent synchronization, FaST is capable of tracking many low-power devices 
without interfering with the real-time requirements for critical applications. 


3.5.4.3 Accuracy 

For the evaluation of the accuracy, several tracks were followed. One of the main contri- 
butions were lab experiments evaluating the accuracy of the signal quality assessment 
under motion capture tracking as published in [680]. To achieve internationally compa- 
rable results, the participation in competitions is key, as it is the only way to benchmark 
localization results comparably. In the following, based on [683], the participation in 
two international competitions is highlighted briefly. 
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The EvAAL competition alongside the Seventh International Conference on Indoor Posi- 
tioning and Indoor Navigation (IPIN2016) in Alcala de Henares, Madrid, Spain is used to 
provide a comparable basis for localization systems in the context of robotic tracking. 
In the fourth track, Indoor Mobile Robot Positioning, six teams registered. While four 
teams registered in-track, two teams were out-of-track from within the organization 
team and were only evaluated for comparison. In [507, 549] the organization and results 
of this competition are covered in detail. 
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Fig. 3.31: Reconstructed competition results for the EvAAL 2016 as a bar chart. The official competi- 
tion metrics were based on the third quartile Q(75 %). 


The goal of the competition was to localize a mobile robot following a predetermined 
track that has to be obtained by the evaluated system. A set of four poles to mount 
the system under evaluation was provided by the organizers to cover the 12m x 6m 
evaluation area. The systems were evaluated sequentially so that one of the challenges 
of this competition was a maximum set-up time of 30 min that was extended to 45 min 
during the competition. The metric used for evaluation by the organizers is the third 
quartile of the Euclidean distance to the track, due to the missing temporal component 
of the evaluation setup. 

A bar chart of the results is shown in Figure 3.31. In addition to the Q(75 %) quantile 
used for ranking, the mean absolute Euclidean error and the Q(90 %) error are depicted. 
It should be noted, that through the use of TDOA-based localization an additional 
unknown error is introduced, which generally lowers the accuracy of these systems. 
This has to be considered when comparing the ATLAS results and accuracy with the 
TWR-based system of the TPM team. 

It can be seen that the ATLAS approach, although utilizing TDOA instead of TWR, 
can provide accurate robot tracking results and is, therefore, usable in the context of 
real-time robotic movement tracking. 

To evaluate the localization approach in a more challenging environment, we 
participated in the fifth iteration of the Microsoft Indoor Localization Competition (MILC) 
co-located with the CPS-Week 2018 in Porto, Portugal. Previous competitions featured 
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broad participation from science and industry, see [422, 423]. In this competition, 
34 teams submitted abstracts, 26 systems officially registered, and 25 systems showed 
up in Porto. However, only 22 systems could provide data and were evaluated. 

The second category that used custom infrastructure such as UWB, was required 
to report 3D locations. Up to ten anchor nodes were allowed in the evaluation area. The 
teams had a time slot of 8 hours to set up and calibrate their systems. 

The teams were evaluated using a mobile laser scanner-based ground-truth system. 
The organizers allocated a 15 min evaluation slot per team and fixed order. However, 

during the competition, the handling was more dynamic so that teams with a non- 
functioning system during their slot had the chance to debug their system and evaluate 
it later. Furthermore, since the competition area was the main staircase for the attendees 
of the four conferences held during the evaluation, the systems had to cope with the 


obstruction and interference of visitors and observers. 
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Fig. 3.32: Reconstructed bar chart of the mean 3D accuracies from the MILC 2018 data after temporal 
alignment and elimination of the 25 % largest errors for each team. Note the diversity of sensors for 


the individual localization methods. 


The organizers chose the mean Euclidean error as the metric for evaluation. This error, 
however, is strongly influenced by the large outliers at the beginning of the evaluation 
run. Furthermore, a temporal offset of around 1s between the ground-truth and the 
ATLAS-team result trajectory is observed. Due to this, additional errors were introduced 
into the evaluation. 
Note that with improved temporal alignment, the mean error of the ATLAS team 
improves. Since it was desirable to compare the performance of the properly initialized 


194 —— 3 Industry 4.0 


ATLAS results from the rest of the teams, the mean of the best 75 % of the temporally 
aligned results was evaluated. In order to provide a fair basis for comparison, the best 
75 % of all teams were considered. Therefore, the analysis does not merely remove the 
outliers but improves the results of all teams as depicted in the lowest bar chart. 

As depicted in Figure 3.32, when considering the aforementioned points, the actual 
performance of the ATLAS system is significantly better than indicated by the official 
results. Even though the ATLAS system uses a TDOA-based approach, which has sig- 
nificant downsides in accuracy, due to the proposed implementation, the results are 
comparable and often better than the other TWR-based approaches. 


3.5.5 Conclusion 


This section presented the basic capabilities, challenges, and novel solutions for high 
precision wireless localization. Here, the bounds in channel utilization, energy con- 
sumption, and achievable accuracy are limiting factors for the feasibility of a wide 
range of applications. 

Therefore, novel approaches for maximizing information and localization through- 
put while minimizing channel utilization and power consumption and maintaining 
precise localization results were shown to overcome technology barriers and ultimately 
enable a connected cyber-physical world. The ATLAS FaST scheduling scheme and 
approaches for improving the achievable accuracy are highlighted. 

It could be shown that based on these requirements, solution approaches are capa- 
ble of improving both, channel utilization and energy efficiency. Further, approaches 
to increase the achievable accuracy in challenging environments can be successfully 
applied to improve TDOA localization accuracy. 

In future work, challenges such as reducing the amount of required anchors, ad 
hoc configuration, and in-depth signal quality assessment with machine learning is 
envisioned. The potential for high precision localization enabled by the characteristics 
of novel communication solutions such as new UWB standards along with 5G and 
future 6G systems is huge. 
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3.6 Indoor Photovoltaic Energy Harvesting 


Mojtaba Masoudinejad 


Abstract: Advancement in the field of electronics has enabled devices with ultra-low 
power (ULP) demands. However, Industry 4.0 devices made of such components will 
be empowered for long operational periods, specifically in remote or hardly accessible 
environments. Scavenging energy from the environment is a very common technique 
to tackle this energy supply issue. Different energy harvesting principles are available 
that exchange energy from distinct forms into electric power. However, we focus on 
Photovoltaic (PV) energy harvesting because it is the most mature technique. 


In addition to the small size and weight limitation of Industry 4.0 and IoT devices, 
which require constraints on the size of a PV cell, they are applied mostly in indoor 
environments. Hence, the specific behavior of PV modules for indoor applications 
under artificial lighting is analyzed here. Using a systematic data acquisition procedure, 
typical PV models are adapted for the ULP harvesting environments. A normalization 
procedure is introduced during parameter tuning because common techniques are not 
applicable on PV modules when operating in low and ultra-low lighting conditions. 
Guidelines are provided to assure the numerical stability of parameter tuning of models. 


The second layer of a two-fold model represents the relation of tuned curves with 
the environmental factors. Using a relative representation according to the highest 
light intensity, these models can be applied for different conditions. Performance of 
the overall model is evaluated on an extra dataset collected from a new environment 
showing model errors less than 6 % in the worst-case condition. 

Parts of this section are taken from [434] with the consent of the author. 


3.6.1 Introduction: Energy Harvesting 


Energy is available in nature in a wide range of forms, from heat and mechanical energy 
to the energy stored in electromagnetic waves and light photons. Any method which 
enables scavenging these energies can be called energy harvesting and a transducer 
that converts them into the desired form is an energy-harvesting device, or harvester 
for short. Energy harvesting has a long history, since windmills, which convert wind 
energy into mechanical energy to mill grains, date back to the 9th century. Nonetheless, 
modern energy harvesting converts energy into electricity. 

Wind, solar, Photovoltaic (PV), piezoelectric, thermal, radio frequency and tidal 
energy harvesting are only some examples of modern energy-harvesting techniques. 
However, techniques based on the conversion of light (specifically solar energy) into 


196 —— 3 Industry 4.0 


Im 


Vm Voc 


V, [V] 


Fig. 3.33: l-V curve of a Sanyo/AM-1464 PV module measured under florescent light. 


electricity are more mature than others due to long research and applications in diverse 
fields. While the term solar is commonly known, it includes different principles for 
converting the sunlight into the electricity. Nevertheless, PV is a specific form that 
uses semiconductor materials and technologies converting light into electric energy. In 
a simplified version, a PV transducer can be considered to use the inverse principle 
of a Light Emitting Diode (LED). In addition to the maturity of PV harvesting, their 
integration into the ULP Industry 4.0 hardware is the main reason we focus on them 
below. 


3.6.1.1 PV Energy Harvesting 

“A PV transducer is a semiconductor device generating electrical power when illumi- 
nated with photons [434]”. These semiconductors have electrons in their valance energy 
band, which is weakly bounded. This bound can be broken by any photon that has 
higher energy than the band gap and causes movement of the electron to the conduc- 
tion band. As long as enough photons are illuminated on the semiconductor surface, 
a photon’s energy is converted into the flow of electrons to convert light into electric 
energy. Due to the nature of this conversion, PV can generate Direct Current (DC). Con- 
sequently, a PV harvester is an electric source, but not an ideal one. According to the 
operational condition of the PV module it can act as either a voltage or a current source. 
The common I-V behavior of a PV module is shown in Figure 3.33. 

As can be seen in Figure 3.33, for a large portion of the voltages (mainly in the lower 
range) the PV module acts as a current source while for most current values (in a small 
voltage range) it acts as a voltage source. However, none of these sections are ideal 
because they are pure vertical or horizontal lines. The bending point where the behavior 
between source form changes is critical because the maximum power can be extracted 
from the module at this specific point. Hence, it is called Maximum Power Point (MPP) 
and techniques used to keep the operational point at this point are called Maximum 
Power Point Tracking (MPPT). When V, and I, describe the harvested voltage and 
current subsequently, MPP can be found mathematically from Equation 3.37. 

OP, dl), 


av, dv, Y»+Ta 50, at: V, = Vm (3.37) 
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Fig. 3.34: Equivalent 2 diodes circuit model of a PV transducer including parasitic resistances. 


In addition to the MPP, there are two other critical points for a PV module including 
the crossing from voltage and current axes: open-circuit and short-circuit values. 


3.6.2 PV Transducer Model 


While some applications require only knowledge of these three specific points, most 
utilizations of PV systems require a more descriptive explanation of the whole I-V curve. 
Based on the Shockley diode equation [62] as an outcome of work by Hall [254] and 
Shockley et al. [604], a multi-diode model of this behavior is explained as: 
Vp + Ip Rs 

2 Rsh : 

when Ig is the photo-generated current, Rs and Rsp are subsequently series and parallel 
resistances. Diode’s current I4 is defined as in Equation 3.39 when * is the diode number. 


Posies fex (25) = 1 (3.39) 


face ped ees (3.38) 


n=- Ve 
In Equation 3.39, Is and n are saturation current and ideality factor of each diode. In 
addition, V; describes the thermal voltage from V; = B - T/q when B is the Boltzmann’s 
constant, T is the temperature in Kelvin, and q is the electric charge. 

From the formulation in Equation 3.38, an Equivalent Circuit Model (ECM) can be 
made for the reproduction of the source behavior of a PV module. This ECM shown in 
Figure 3.34 is able to replicate characteristics of the PV cell, including its non-idealities 
and resistances. However, according to the desired accuracy of the I-V curve replication, 
it is possible to reduce the number of diodes into one. But some applications utilize 
higher number of diodes to increase the degree of freedom for a better replication of 
the curve. 

Regardless of the number of diodes in a PV transducer’s model, this model has to 
be fitted into the real curve of each module and light type. Consequently, the number 
of parameters to tune differs according to the number of diodes. Moreover, a system 
designer may even remove some parameters from the model in some applications. 
Therefore, a general vector W will be used to represent these unknowns. 


3.6.2.1 Parameter Tuning 
There is a large body of research on methods for tuning w for each PV module. However, 
they can be simply divided into numerical tuning and Algebraic Equation Set (AES) 
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tuning. In numerical methods, a set of points on the I-V curve is measured and directly 
used to solve an optimization problem to minimize the error. For an ideal case when 
the whole I-V curve is measured, this can be formulated as error minimization when i, 
shows the current from the tuned model: 
Voc 
min (yp) = f NVD -GV W) le AV (3.40) 
0 


While this method can be explained in a simple way, its application has different 
challenges. At first, a large set of measurements from the PV system is necessary in a 
constant environmental condition. It can be especially challenging to keep the light 
intensity constant for the whole measurement duration. Other issues are related to the 
computational aspects of this method. On the one hand, a roughly computationally 
intensive optimization has to be solved to reach a set of reliable values for parameters 
in yw. On the other, these methods are sensitive to the initial values used for these 
parameters. Another issue that adds complexities to this utilization is finding the current 
value for each voltage in the model. As seen in the model’s formulation, Equation 3.38 
describes an implicit relation between voltage and current. Therefore, the calculation 
of current requires either an explicit relation or has to be done in a numerical (iterative) 
way. For the single-diode model, Femia et. al. [204] gives an explicit relation between 
parameters using the Lambert function as: 


Rsn: (g +Is)-Vp n : 
R., 1R; A W (0i) (3.41) 
where Lambert W function is presented with W, and 6; is: 


Rsn + Rs sh Rs + (Ug + Is) + Rsn' Vt 
n» (Rs + Rsa) n» (Rsn + Rs) 


In = 


0; = -Is+exp R (3.42) 
Although this simplifies the single-diode finding of the current in the model, the two 
diodes model (which is more accurate) still requires a numerical solution for each 
voltage value. Therefore, there is a numerical iterative function solving inside the 
optimization in Equation 3.40, which is also solved in a numerical iterative sense itself. 

The computational complexity of the numerical method is the main reason for 
most researchers to overcome this challenge by finding alternative routes for tuning 
parameters in yw. Using physical knowledge from the model is a common way to build 
an AES that can be solved with less computation. From the general I-V relation and 
basic knowledge about a PV system, several equations can be formulated: 


arSCiVe OSs Ta Tee leept 8 aa te (3.43a) 
n: Vi Rsh 
at OC: I, = 0 > 0 = Ig — Is (exp Voc 1| - Vee (3.43b) 
E n: V: Rsh 


at MPP: Iņ = Ig - Is lexp (<u: Im: 2s) i Vn + Im Rs (3.43c) 
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It has to be noted that these equations are for a single-diode model but can be expanded 
for the double-diode model as well. While open circuit and short circuit can be easily 
measured, finding the MPP without any prior knowledge is not possible. However, it 
is known that the MPP is mostly proportionate to the open circuit voltage (Voc) and 
short circuit current (Isc). Hence, it is possible to simply measure the parameters at 
these specific proportional values and use them in Equation 3.43c, though it introduces 
a marginal error. The above equations are only a mathematical representation of the 
exact points from the curve, which is somehow similar to the numerical method. Hence, 
using the limited knowledge from this AES can provide a minimal fit to the model on 
these specific points. Furthermore, even a single-diode model has a p with 5 values. 
Consequently, this AES is under-determined and requires either more equations or a 
reduction of some parameters. Although few researchers have experimented on models 
with only 3 or 4 parameters, it is a well-established principle to use the derivation of the 
model on keypoints to expand the AES. These equations for a single-diode model are: 


By as pe Mandl ea GIN aie Vidal We 
(ar) K Ear (+ (ai) , Rs) ep (a) 


1 dl), 
pall hee [i ee) Fae 3.44 
Rsn ( (at), s) i i 
dl), 1 a) (= zE) 
I 1 R 
(at), , g f V: ( (T a :) Pn- V 
mie: 1 . ap, . 
Re (+ (HE), re) (3.45) 
dl, 1 T) ) (eE) 
ERS ee (4 ath SRE te IMTM S 
(v k s E Ve ( (a MPP ape n: V 
1 E) ) 
Sea aie E E»: 3.46 
Rsn ( (a MPP : l ! 


Furthermore, the derivation of power according to the voltage is zero at MPP, which 
leads to Equation 3.47. 


OP; ) ( dl, ) ( dl), ) Im 
= V,: +I,=0 = 3.47 
(sr MPP Í dV, MPP i dV, MPP Vm i ! 


Substituting this in Equation 3.46 adds Equation 3.48 as an additional equation to the 
AES. 


ag (1 _ Iu Rs) (3.48) 


From all these equations, an AES with 7 equations can be made that are all explicit. 
Nevertheless, in addition to the complexity of measuring MPP, the challenging task of 
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finding the curve’s slope at keypoints is mandatory. Some simplified solutions suggest 
calculating the slope from the slope of a line between keypoints, though it adds error 
to the tuning performance. Therefore, further measurements are necessary to improve 
the calculation of the slopes. Moreover, these equations fit the model’s formula to the 
curve only for the measured keypoints and their derivation and cannot provide high 
accuracy on other points. However, it is possible to measure any further point from 
the curve and add it as a simple equation in the AES. As can be seen, both parameter 
tunings have their advantages and disadvantages. 


3.6.2.2 Environmental Factors 

An I-V curve represents a PV module in a specific environmental condition according 
to light intensity (E) and temperature (T). While the overall form of a curve remains, its 
details, including keypoints and slopes, shift with these parameters. Therefore, a further 
level of model is required to explain the effect of environmental factors. Some initial 
models explain the changes in keypoint values due to the deviation in environmental 
elements. Unfortunately, these models are not consistent in the literature and there 
are different formulations for a single parameter. However, there is a similarity in the 
methodology of these models due to their origin from PV operation in the solar light. In 
these models, environmental factors are described in a relative sense according toa 
reference condition, which is mostly the AM1.5 condition. It is described under a single 
sun at 1000 W/m? with a perpendicular line of sight to the PV cell at 25°C. 

Few relations are available in the literature, which explains some of the parameters 
in w according to the environmental factors. However, most of them use some kind 
of simplification assumption and mostly explain a specific application case study. 
Masoudinejad [434] reviews the state-of-the-art formulation for these parameters. 


3.6.2.3 Indoor PV Energy Harvesting 

Despite the maturity of PV energy harvesting and the availability of diverse techniques 
for analysis and modeling, most of the available methods are based on the applications 
under sunlight. By contrast, modern use cases, especially within the IoT and Indus- 
try 4.0 realm, are in indoor areas with artificial lighting. Consequently, a revision of 
the available methodologies and validation of their solutions is required. One of the 
major difference in these fields is the scale of light intensity. Figure 3.35 provides an 
overview. As can be seen, the light intensity in some industrial applications such as 
PhyNetLab [197] as an industrial warehouse is multiple orders less than solar-based 
applications. In addition to the light-intensity level, the indoor light spectrum has 
many more forms and can differ according to the light source, building materials, and 
surrounding environment. This difference in the spectrum between solar light and 
typical indoor light sources can be seen in Figure 3.36. 
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Fig. 3.35: Light intensity range in some common conditions. Reproduced from [146]. 
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Fig. 3.36: Left: outdoor solar light spectrum [479] and sun’s black body radiation at 5700 K [703]. 
Right: measured indoor light spectrum of three different artificial lighting. 


Differences between light sources directly affect the I-V curve of a PV module. Figure 3.37 
provides an example. Although both light sources are from the same manufacturer 
with the same specification and power, they produce dissimilar curves. 

The only difference between these two sources is their color temperature, which 
can be seen in their spectrum in Figure 3.38. 

As can be seen, although integrative light intensity of both sources is equal at 
248 lx, their color temperature difference is a consequence of a discrepancy in the form 
of spectrum. Due to the non-uniform sensitivity of PV modules to each wavelength, 
the produced I-V curve will be different for each specific condition. Consequently, not 
only is an analysis of the PV behavior and modeling under artificial lighting necessary; 
careful consideration is required as well. 
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Fig. 3.37: l-V curve of a Solems PV module measured under cold and warm LED light. Both sources 
are from the same manufacturer measured at: E =248 Ix and T =299 K. 
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Fig. 3.38: Light spectrum measured under cold and warm white LED light both at 248 Ix. In spite of 
equal integrative value, the spectrum is different. Left: irradiance, Right: illuminance. 


3.6.3 Indoor PV Modeling 


The diversity of the indoor lighting types and the lack of reliable data from PV behavior 
under indoor artificial lighting demand the collection of representative datasets for 
them. This data can be used later to analyze the I-V curve for such lighting, for parame- 
ters extraction through curve tuning and for formalizing the relation of them according 
to the environmental factors. 

The required information for each measurement of the PV in an indoor area in- 
cludes the I-V curve and light-intensity information in addition to the temperature. For 
measuring the I-V curve, a variable impedance has to be connected to the PV cell. Start- 
ing from a very large value this impedance will decrease till the open circuit voltage is 
reached. During this impedance sweep, voltage and current have to be measured simul- 
taneously the whole time. This can be used to reproduce the curve. Such a procedure 
can be applied by using a Source Measurement Unit (SMU). 
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Fig. 3.39: Examples of measured l-V curves. Left: in normal space. Right: in PVNS. 


There are multiple systems and devices for light intensity measurement. To have com- 
bined information, both integrative measurement and spectrometry are used here. 
Moreover, a closed environment with controlled light intensity is used to assure reliable 
and reproducible lighting. Using a system with detailed explanation from [432, 433, 
435] different datasets are collected, which are publicly accessible from [433]. 

These datasets will be used below to develop the models and explore indoor PV 
behavior. However, there are two preliminary topics that need to be discussed before 
modeling. 


3.6.3.1 PV Normalized Space 

Looking at the available indoor datasets, as can be seen in the example in Figure 3.39, re- 
veals that signal ranges are in different magnitudes. This diversity can cause calculation 
errors and add complexity to the numerical method calculations. 

Therefore, PV Normalized Space (PVNS) is introduced to scale all curves into a 
similar range and avoid numerical problems. This conversion can be simply applied by 
converting the voltage and current of each curve according to its maximum referred to 
as Voc and Isc, respectively. It has to be noted that resistances have to be scaled as well, 
though because the parameter n does not have units it does not require any scaling. All 
in all, the relation of all scaled parameters (shown by ®) is: 


D— 1 1 Li g 1 1 ak 
P P s [i ?, Voc r Voc ? Voc x Isc 2: Isc 2; +] ý 6.49) 
when: 
P= [Vas Ve, Rs, Rons In, Ig, Is] ; (3.50) 


The effect of this scaling on the I-V curve can be seen in Figure 3.39 on the right. This 
scaling highlights the differences on the form of the I-V curve as a consequence of 
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Tab. 3.7: Bound suggestions for the PV model’s parameters in the PVNS. 


Parameter Lower limit Upper limit Unit 
Ig 0.9 1.1 [A] 
Rs E (oe = Vin) fing [Q] 
Rsh Vm/ (sc - In) oo, [9] 
ls € 1.1 [A] 
n 1 10 [-] 


different light ranges, which are not easily detectable on the original curve. Moreover, 
it brings all available data into a uniform scale which helps to reduce the sensitivity to 
the signal values in numerical methods. 


3.6.3.2 Evaluation Criteria 

Similar to the modeling procedures, an evaluation factor is required to quantify the 
performance of a model. When replication of the I-V curve is desired, relative perfor- 
mance factors are preferred due to the large signal range. Nonetheless, division to zero 
at the open-circuit point can hinder calculating a relative factor without removing 
this point. Unfortunately, open circuit is a keypoint and plays a critical role in any 
model. Consequently, the Mean Absolute Normalized Error (MANE) is defined here as 
in Equation 3.51. It normalizes the percentage absolute error according to the Isc to find 
the mean value. This can be explained as the Mean Absolute Error for the I-V curve in 
the PVNS as well. 


Voc 1 
MANE -= = OO = [isn dV, = 100- fis (In)| dV, (3.51) 
Ise i Voc 
(0) 0 


or for the case of discrete measured data with m points: 


MANE = — DIOE 1%. S46 (Ti) (3.52) 


3.6.3.3 IV Curve Parameter Tuning 

In the first step of the modeling, p parameters have to be tuned for each I-V curve. For 
this purpose, SWL, SCL, and IWL datasets from [433] are used. For the numerical tuning 
method, as discussed in 3.6.2.1, the initial value used for each parameter plays a critical 
role. After a large set of experiments, a general guideline can be provided as in Table 3.7 
for the bounds on these parameters in the PVNS. Using these bounds, parameters for 
all curves in datasets are tuned using least square method. This process is repeated for 
both single- and double-diode models using 200 equidistant points along the voltage 
axis, for each curve. Distribution of error for all cases is presented in Figure 3.40. As 
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Fig. 3.40: Density of MANE distribution for all datasets using, the single-diode model (left) and 
double-diode model (right). 


could be estimated, double-diode model has a much better performance compared 
with the single-diode counterpart. This can be simply argued because of extra tuning 
parameters. 

The application of the AES-based method on these datasets is computationally 
simpler while it is very sensitive on the way that the slope of the curve is calculated. 
However, comparing the results with the numerical method, AES-based tuning has 
lower performance because of the extensive number of points in the numerical method. 
Hence, tuned parameters from the numerical method continue to be used. Their distri- 
bution according to the light intensity for the single- and double-diode models can be 
seen subsequently in Figure 3.41 and Figure 3.42. 
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Fig. 3.41: Changes in single-diode model parameters according to the light intensity for all datasets. 
Color of SWL points shows temperature in K as in the color bar. 
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Fig. 3.42: Changes in double-diode model parameters according to the light intensity for all datasets. 
Color of SWL points shows temperature in K as in the color bar. 
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As can be seen from these figures, the single-diode model parameters have a clear 
relation to the light intensity and temperature. Although this holds for some parameters 
of the double-diode model, others such as resistors and saturation currents show 
abnormal behavior. This can hint at an over-dimensioning of the model with a second 
diode that introduces parameters that are not actually part of the physical system. 
Therefore, a hypothesis can be made that a secondary diode (or at least some of its 
parameters) is not part of the physical model of the PV system under artificial lighting. 
Hence, it improves the performance of the tuning but is no physically significant at 
the higher level. This behavior is somehow similar to the principle of over-fitting in 
machine learning, where the model focuses too much on the data, so that the overall 
form gets lost. Nevertheless, this hypothesis cannot be proven at this stage (using 
available data) without any further physical insight. Consequently, the single-diode 
model will continue to be used here for the remaining part of the modeling of the indoor 
PV system. 


3.6.3.4 Modeling Effect of Environmental Factors 

After tuning all parameters for the single-diode model, the next step is to formulate 
the relation of each parameter with the environmental factors. However, since this is a 
purely data-based modeling approach, the resulting models are empirical and will not 
assure any physical insight. Yet, by building these models for all datasets that include 
different PV technologies and light sources, the generalization of the model can be 
kept to some extent. After testing diverse function types on the available data, these 
relations are formulated as: 


en i z 
Ig = a =Ag1+E +g: AT (3.53a) 
g 
R; _ Rs _ Asi + Asay * AT (3.53b) 
Rs s2 + s3: E 
— Qn1 + Ang AT ~ 
TE Rsh apitaye AL i (ap -E+ aps) (3.53c) 
Ron apr +E 
ae er ec ee ag AT (3.53d) 
I; ai , B(autais-T) 
2 on ~ 1 
N= — =Qn1+Qn2:E+ — + Ano AT (3.53e) 


= 


In these equations, parameter a shows a tuning factor for each dataset, while @ is 
a representation of a relative parameter. Each relative parameter under a reference 
condition is similar to the AM1.5 for the solar case. However, contrary to the solar case 
with a unique reference point, there is no constant condition that can be defined for all 
indoor environments. Hence, the point with maximum light intensity in each dataset is 
used as a reference point and all its parameters are used as a base condition. Despite 
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parameters whose relative value is found from a ratio, the difference of temperature is 
more suitable for these formulations and is used in Equation 3.53. 


Using these formulations, a is tuned for each dataset. As an example, the output 


parameter for each light intensity in the SWL dataset is presented in Figure 3.43. It is 
clear that these provided empirical models are able to replicate the behavior of each 
parameter with a good accuracy. 
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Fig. 3.43: Performance of models for parameters at different environmental conditions on the SWL 
dataset including temperature effect. 


It has to be noted that parameters of the I-V curve with maximum light intensity are 
used here as the reference point. Also, it is advisable to remove the temperature factor 
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for a dataset when the deviation of temperature is minimal and can be ignored. In this 
way, the model can be simplified without reducing its performance. 


3.6.4 Indoor PV Model Evaluation 


So far, three available datasets have been used for the parameter tuning and environ- 
mental factor modeling. Since the overall behavior of the models has been checked 
on the same dataset, it can not guarantee a reliable performance for other conditions. 
Therefore, an extra evaluation on a new dataset is beneficiary. Consequently, a new 
set of data has been collected in the PhyNetLab, including 120 samples collected in 
different positions, heights, and temperatures from different days. This set includes 
light intensities between 244 1x to 4941x and temperatures in the range of 298 K to 
302 K. 

To avoid error due to the training during the tuning of a factors, 30 % of samples are 
selected randomly in a uniform distribution of the light intensity. Parameters for this 
smaller set is extracted and used in a least square method to tune a in Equation 3.53. 
Similar to the model itself, the highest light intensity in this subset is used as the 
reference condition. The resulting formulations are used to find p parameters for the 
remaining 70 % of the data. Using these parameters, the resulting I-V curve is compared 
with the real measured value to find the MANE presented in Figure 3.44. 
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Fig. 3.44: MANE of model on the evaluation dataset. Only 30 % of data (randomly selected, shown 
with circles) was used for parameter tuning. 


As can be seen, all errors are in a very small range, which shows reliable performance 
from the empirical abstract level model on a new indoor environment. When repeating 
the random selection subset, the worst case MANE is always less than 6 %. Hence, the 
overall modeling principle can be accepted and applied to other environments. 
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3.6.5 Conclusion 


This contribution has discussed the basic principles of the PV energy harvesting and 
their modeling. After a short review of available PV models, we introduced tuning and 
adapting a model for a PV transducer. We then discussed the differences between PV 
behavior under solar light and artificial sources (used for indoor environments). Next 
we determined the need for model development and evaluation specifically for indoor 
PV harvesting. 

It was shown that each light source and indoor environment has its particular 
specifications due to differences in the light spectrum seen by a PV module. Therefore, 
a new setup was presented to assure reliable, high accuracy, and reproducible indoor 
PV behavior data. Using this data, a PV normalized space was introduced to enable 
a unified modeling strategy, regardless of large differences in the signals within the 
indoor environment. Furthermore, the mean absolute normalized error was introduced 
as anon-compromised evaluation factor for model performance of each I-V curve. 

Using the collected data, guidelines were provided for the boundary of tuning 
parameters in I-V curve fitting. While parameters were tuned for single- and double- 
diode models of a PV system, it has been shown that some parameters in the double 
diodes model lack specific relation to the environmental factor. This led to a hypothesis 
that the double-diode PV model includes more parameters than its real physical factors. 
Therefore, these parameter(s) improve the fitting but with the price of loosing the 
physical meaning. Hence, single-diode model data was used to develop empirical 
models for each parameter according to the environmental factors. These models use 
relative values according to a reference point which is simply selected as the data point 
with the highest light intensity. 

Finally, these models were tested using all available datasets, and a new set of 
data was collected in a real-case scenario including 120 samples. The environmental 
condition of this new scenario was tuned into the model using only 30 % of data points 
selected randomly. The application of this tuned model showed promising performance 
with MANE of less than 6% in the worst-case scenario. Therefore, we showed that 
adaptability of the developed models on new real world environments and their good 
performance on predicting the I-V curve of a PV module in new environments. 
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Abstract: This contribution describes a testbed for the development of resource- 
constrained micro-UAV (Unmanned Aerial Vehicle) swarms . The testbed architecture 
combines external observation, visualization of logistic scenarios, and simulation 
systems with a drone control unit. Swarm algorithms are implemented on the drones 
to control their movements and enable their cooperation. A drone learns to perform a 
warehousing task using reinforcement learning. In combination with swarm algorithms, 
this behavior is extendable to a drone swarm. This work describes how drones can 
be deployed to solve tasks in industrial settings. In addition, an automatic charging 
station extends the runtime of the swarm. 


3.7.1 Introduction 


Drones are more formally known as unmanned aerial vehicles (UAVs) or unmanned 
aircraft systems. They are battery-powered devices, that can be as large as an aircraft 
or as small as the palm of a hand. A drone is an intelligent flying system with sensors 
and actuators that can be remotely controlled or fly autonomously. Due to its high 
adaptability, the use of drones is increasing across many civil application domains. Over 
the last decade, extensive research has been performed on consumer and commercial 
drones. Due to the advances in microelectronics, sensors are steadily getting lighter, 
smaller, more economical, smarter, and more accurate. As a result, drones are getting 
smaller, more energy-efficient, and easier to operate [521]. Moreover, there is a growing 
need to build small and resource-constrained drones that fit perfectly into the Industry 
4.0 setting, where all intelligent devices are networked and can exchange their data 
with each other [52]. 


3.7.2 Drones in Logistics 


Drones have shown high potential in the logistics industry. Some have projected that 
this market will grow by $29 billion by 2027 with an annual growth rate of almost 20 % 
[721]. A 2018 study found that electric drone delivery was more efficient than trucks, 
vans, passenger cars, and gasoline drones [650]. Electric drones are environmentally 
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friendlier than other aerial vehicles when comparing CO2 emissions. Drones are a 
potential substitute for traditional delivery methods, such as door-to-door or last-mile 
delivery using trucks. The difference is that they do not consume fuel and do not 
need infrastructure such as streets. However, drones still pose several safety hazards, 
limiting their use in industries and outdoors. Some common safety hazards include 
drones colliding with people, structures on the ground, and other drones [720]. Thus, 
more research is required to make drones safe to use in urban and industrial settings. 
Deploying drones for transporting goods to end users may transform the existing 
transportation methods in large cities and rural areas. They might not completely 
replace the traditional methods but will significantly impact this domain [521]. 
Warehouse management is an essential operation for most business activities 
nowadays. Manual inventory has been the only option for a long time, but it poses 
several challenges in terms of costs, inaccuracies, and safety [151]. Furthermore, in 
the EU, warehousing and storage represent up to 15 % of the current costs in logistics 
[193]. Thus, there is a growing need to automate warehouse operations while ensuring 
the warehouse flexibility and adaptability. Such automation is something automated 
storage solutions made of mechanical conveyors such as high-bay warehouses and 
automated small parts storage cannot achieve. However, drones could potentially 
cater to this ever-increasing demand. There has been extensive research on drone 
use in transporting goods outdoors, and the researches are available commercially 
[418, 642]. However, large-sized drones cannot reach small, constricted spaces such as 
warehouses. The key reasons are that GPS-based navigation systems are unavailable, 
and the tolerances regarding collision avoidance and the time available for decision- 
making are drastically lower. By contrast, small-sized drones can operate in the interiors 
of warehouses and the narrow and high rows of shelves. Thus, one can integrate small- 
sized drones into the supply chain to automate various intra-logistics operations. [525]. 
Research on deploying autonomous drones and robots for human-machine inter- 
action in warehousing is still in its early stages. However, industries have pioneered 
drones as extensions to their IoT environments or complement other data-gathering 
processes. Most IoT devices have a limited battery life, are stationary, and thus can 
collect the data of a specific location for a limited time. However, resource-constrained 
small drones can gather data from dangerous areas and other fixed IoT devices. Subse- 
quently, they can transmit all the collected data to a central station. Drones can also 
aid in recharging IoT devices wirelessly or remotely. Thus, drones can be essential in 
connecting IoT devices to the whole IoT ecosystem [21, 26, 417]. This work introduces a 
testbed to replicate and experiment with any industrial scenario involving small-sized 
resource-constrained drones working alongside humans. In addition, an automatic 
charging station has been proposed, to ensure the continuous operation of drones. 
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3.7.3 Application 


A possible logistics application for resource-constrained drone swarms, especially 
those that can recharge independently, is their use as a mobile surveillance system. 
For instance, this drone swarm can be used in warehouses to protect the stored goods 
from theft. The drones can then perch like birds on the existing trusses, high racks, and 
steel beams and observe their surroundings. From their high positions, they can gather 
a larger amount of data instead of low-power stationary IoT devices equipped with a 
battery. A charging station enables the continuous operation of the drone swarm. If 
a drone is low on battery, it automatically flies into the charging station, and another 
fully charged drone takes over its tasks. 


3.7.4 Swarm Algorithms 


Swarm algorithms create a group dynamic similar to the swarm behavior of animals, 
such as a flock of birds or a school of fish. Craig Reynolds developed a simulated swarm 
behavior in 1987 with the help of these observations [526]. In logistics systems, a swarm 
algorithm is similar to traffic rules, which enable a smooth traffic flow by observing 
the local environment with as little communication as possible between the vehicles. 
One advantage of using swarm algorithms in logistics is that they do not require fixed 
routes but can use all available spaces. This allows them to easily change the layout 
and react to unpredictable events or disruptions. Furthermore, due to the modular 
design of swarm algorithms, new behaviors can be added without altering the existing 
ones. The paradigm of swarm behavior aims to enable several independent robots to 
collaborate towards achieving a collective goal, acting as a swarm. 

Swarm behavior, established by Reynolds, is implemented by three basic rules: 
the cohesion rule, the alignment rule, and the separation rule [153]. Figure 3.45 shows 
the rules schematically. The active agent in each case is the highlighted black triangle 
with its current velocity vector. The remaining triangles represent other agents. The 
highlighted velocity vector indicates the resulting vector adapted to satisfy the require- 
ments of that rule. The white circles visualize the effective range of local perception of 
the active agent. The black circle represents the center of all agents located in the local 
perception. 

The cohesion rule aims to move toward the swarm’s center. Thus, an agent tries to 
point its velocity vector to the center of its local perception. The alignment rule ensures 
that all agents within the range of local perception aim for the same direction of motion. 
Finally, the separation rule states that a minimum distance to all neighbors [153] must 
be maintained. 
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Fig. 3.45: Schematic representation of the three basic rules for simulated swarm behavior [153]. 
Mobile obstacles are people moving with headgear, and static obstacles could be the charging 
station, walls, or shelves. 


Weighting the individual rules influences the appearance of the swarm behavior. For 
example, if collisions must not occur under any circumstances, the separation rule 
receives a high weighting. After assigning appropriate weights, the calculated velocity 
vectors are summed up to a control vector, which determines the final movement 
of an agent, as can be seen in Figure 3.45. The above three basic rules can also be 
supplemented by other rules as seen in [483]. One of the advantages of swarm control 
is that there is no need for the centralized control of each drone. Depending on the 
implementation, each drone in the swarm can perceive its local environment by itself 
and avoid collisions. Thus, it is possible to manoeuver an arbitrary number of drones 
simultaneously without calculating a separate trajectory for each drone centrally. This 
decentralized control makes the swarm implementation scalable. 
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3.7.5 System Architecture of Drones 


The drones used in this work are a custom design using the Crazyflie 2.0 as a base 
platform [509]. Figure 3.46 shows the UAV used in this work. A single drone has a 
size of 103x103x29 mm and a total weight of 37.85 g. It can carry a payload of 64.85 
g (compared with 15 g for an original Crazyflie drone). Thus, drones can use larger 
batteries which increases the flight time. Furthermore, additional sensors such as 
cameras and Lidar can also be mounted on the drone. 
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Microcontrollers 


Fig. 3.46: Crazyflie 2.0 drones used in this work. 


Four brushless DC motors on which four rotors with a diameter of 60 mm are mounted in 
reverse, comprise the propulsion system. Thus, the flow characteristics of the propellers 
are not influenced negatively by the frame. In addition, the frame structure itself is 
lightweight and built using fibre composite material. 

The computing hardware consists of two microcontrollers. An STM32F405RG from 
STMicroelectronics [648] is used as the main computing unit. It performs all computa- 
tionally intensive calculations and control tasks of the drone. For this purpose, a 32-bit 
ARM Cortex M4 with a clock frequency of 168 MHz and a floating-point arithmetic unit 
are used on the microcontroller. The SRAM has a capacity of 192 kB, and the flash mem- 
ory holds 1 MB of program code. An EEPROM of 8 kB is connected to the microcontroller 
via the I?C bus to store static information. 

A second microcontroller (nRF51) from Nordic Semiconductor is used for wireless 
communication and as a power manager. This microcontroller uses a 32-bit ARM Cortex 
MO, which clocks at 32 MHz and has a 32 kB SRAM and a 128 kB flash memory. It has a 
low idle power consumption of 3 pW [591]. The two microcontrollers communicate via 
a UART interface. 
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Fig. 3.47: System architecture for a drone swarm as a platform for data acquisition. The main compo- 
nents with their interfaces and the interaction between all components are shown [509]. 


Functionalities of the entire system architecture of the drone swarm are realized on 
various external systems, as seen in the Figure 3.47. Due to the loose interconnection of 
the multiple systems in the architecture, they can be easily replaced. 

The Robot Operating System (ROS) server is well suited for inter-process commu- 
nication [457]. Therefore, it is a part of the high-level application layer for controlling 
Crazyflie 2.0. Data from the MoCap system determines the absolute position and the 
attitude of the drones. A unique marker configuration consisting of four markers is 
pasted on every drone to identify each drone uniquely. The captured point cloud from 
the MoCAP system generates set points for the drones and compensates for the drift of 
the inertial measurement unit onboard the drone. All parameters converge in a ROS 
server and are forwarded to the drone. 

The main computation task performed onboard the drones is the calculation of the 
flight parameters. A trajectory is computed from the set point and the state estimates in 
a specified time. The set-point values are obtained through an Extended Kalman Filter 
(EKF) to achieve a higher overall control accuracy. Trajectories can also be calculated 
externally and transmitted directly to the drone. 

Communication between the external systems and the drone happens via the 
2.4 GHz ISM band. The CRTP (Crazyflie Real-Time Protocol) is designed to send data 
packets without much overhead, thus minimizing the latency. The transmission speed 
is configured to 250 kbit/s to prevent interference caused by the metal walls of our 
research center and enable the transmitted signals to be decoded reliably by the drones. 
The MQTT server transmits drone information from the ROS server to other subsystems. 
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3.7.5.1 Recharge Station 

Another adaptation is having the drones charged in a drone recharge station. Therefore, 
two holders with which the drones can hang in the station for charging, are attached to 
the drone frame. The charging current for the drone then flows via these holders. Thus, 
a battery exchange is no longer necessary, enabling continuous operation. 

The drone recharge station represents another external system. It is constructed 
out of stainless steel rods and currently allows charging up to 32 drones distributed 
over three floors. A 200 W 5 V/20 A power supply feeds power to the recharge station. 
The recharge station can be hung from the roof or fixed in another safe place. When the 
drones land on the station, the charging of the drones starts automatically. Therefore, 
the 24/7 operation of the drone swarm is achievable. If a drone is low on battery, it 
automatically flies into the charging station, and another fully charged drone takes 
over its tasks. 


3.7.6 Testbed 


The testbed was developed at the Chair of Material Handling and Warehousing research 
center at TU Dortmund University. It is housed in a lightweight hall that is structurally 
identical to conventional industrial buildings in the logistics sector and follows the 
concept of a highly flexible development laboratory [478, 520]. Developing the testbed 
aims to create an environment for simulating real warehouse scenarios. The testbed is 
equipped with an infrastructure designed to prototype Cyber-Physical Systems (CPS) 
accurately. While the test area remains free of permanently installed infrastructure, the 
hall contains several permanently installed observation systems on the ceiling, walls, 
and floor. 
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(a) (b) 


Fig. 3.48: General view of the testbed created in the research hall. (a) Person (wearing a headgear) 
standing between the drone swarm during a test scenario (b) Charging station hanging from the roof. 


3.7.6.1 Architecture 

Figure 3.49 shows the architecture used to create the testbed. The architecture is similar 
to the one used in [465]. The MoCap system consists of 46 infrared cameras manu- 
factured by Vicon. It can track a substantial amount of appropriately marked objects 
with an accuracy of 0.3 mm and operates at a data transmission rate of up to 200 Hz 
with latencies of 4ms to 15ms. The observed experimental space is 22m long, 15m 
wide and up to 3.5m to 4m high. The localization data is accessible to multiple clients 
simultaneously over the network and provides the absolute position and attitude of 
the marked objects in a three-dimensional space. Figure 3.48 shows a general view of 
the testbed. 
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Fig. 3.49: Architecture used for creating the testbed [465]. 


The testbed uses MQTT (Message Queuing Telemetry Transport) [292] to distribute 
data between the subsystems in Figure 3.47. The simulation subsystem, designed as a 
development platform, consists of a programmable 3D modelling environment. In this 
work, Unity is used for 3D modelling. Unity generates virtual objects and scenes. All 
objects in a scene are mapped into an inheritance tree, which can be manipulated or 
extended using the C# programming language [307]. In addition to the common objects 
of the 3D modelling environment, such as cameras or light sources, the C# script creates 
custom objects. 

The simulation subsystem comprises a laser projection system consisting of eight 
Kvant Clubmax FB4 laser projectors [373]. The laser system generates both static and 
dynamic projections of virtual objects from the simulation. Therefore, it is possible 
to visualize complex algorithms [436]. Visualization can be done for demonstration 
purposes and a better understanding of the complex behavior of an algorithm. 
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All marked objects in the physical space are mirrored into the Unity 3D simulation envi- 
ronment via the MoCap system. In addition to the representation of physical objects, the 
simulation can contain any number of virtual objects. Virtual objects with no physical 
representation are mirrored on the hall floor via the laser projection system. Unity sends 
simple motion setpoints to the UAVs and receives the resulting positions of the UAVs 
via the MoCap connection. The low-level control logic is performed onboard the UAVs. 
UAVs and their trajectories can be simulated at an accelerated time in simulation-only 
mode. 

In this work, TensorFlow is used to develop machine learning algorithms. These 
algorithms are coupled to the simulation environment via the ML-Agents toolkit [307]. 
The toolkit implements reinforcement learning on a drone and uses Unity as a training 
environment. During training, the simulation is executed up to a hundred times faster. 
The learned behavior in neural networks controls the drone using Unity. The current 
system uses C# in the Unity simulation, Python and C++ for drone control based on the 
ROS, and plain C on the embedded systems of the drones. 


3.7.6.2 Drone Swarm Setup 

The testbed described in the previous section simulates swarm control on a physi- 
cal drone swarm. The current swarm at the research hall consists of up to 16 drones 
controlled by an extended version of the open-source project [509]. Based on the archi- 
tecture in Figure 3.47 a transport scenario in a warehouse is replicated. 

The drone swarm flies in a test area indicated by the laser projection system. Hu- 
mans can enter the test area if they wear laser protection glasses and headbands with 
markers, so they get recognized by the swarm as a mobile obstacle. Humans can safely 
move within the swarm as long as they move at moderate speeds. After takeoff, the 
drones fly in the range between 1.5m and 2.6m in height. 

Transport orders are created by the laser when a marked Frisbee disc gets thrown on 
the ground in the test area. The transport order is generated as a virtual packet projected 
by the laser system. The target for all orders is an area on the ground indicated by the 
laser. The drone with the shortest path to the packet flies to it at a low altitude and picks 
it up. The drone then delivers it to the target area at a low altitude and ascends back to 
join the swarm. Multiple orders can be picked up by multiple drones simultaneously. 


3.7.7 Reinforcement Learning for Micro-UAV Swarm 


After successful simulation of the drone swarm in the testbed, we extended the capa- 
bilities of the swarm by using machine learning to perform a warehouse task. This 
work uses Reinforcement Learning (RL) algorithms. In RL, the algorithm is not given 
examples of optimal outputs but instead discovers them by trial and error [70]. 
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The goal in RL is to find suitable actions for a given situation to maximize the reward. 
In RL, an agent is an entity that acts based on a policy. The policy defines how an agent 
behaves based on the observations it perceives at a given time. An agent is embedded 
within an environment, and at a given instance, it is in a specific state. The value of 
a given state refers to how rewarding it is to be in that state. From the current state, 
the agent can take one of the sets of actions that can bring it to a new state, provide a 
reward, or both. The agent’s main objective is to maximize the total cumulative reward 
that it receives over the long run [664]. 

Reinforcement learning with drones is being extensively used for various applica- 
tions such as drone tracking, and following the leader drone in a swarm [14], achieving 
a decentralized control of a drone swarm [51], collision avoidance [518], and trajectory 
planning [318]. This work shows that machine learning techniques such as RL can be 
tested in the testbed environment created in the research hall. 


3.7.7.1 Scenario 

In this work, the task of transporting an object to the target area is simulated using 
RL by a single drone as a proof of concept. A drone flying in the testbed is presented 
with a virtual object. At first, the drone wanders around in the testbed, unsure what to 
do. Eventually, it picks up the object and delivers it to the target area, getting a reward. 
After multiple training sessions, the drone learns that picking up and delivering an 
object is the best way to maximize the reward. 


3.7.7.2 Implementation 

The ML agent toolkit creates simulated environments using the Unity Editor and in- 
teracts with them via a Python API [307]. The toolkit provides the ML-Agents SDK, 
which contains the necessary functionality to define an environment within the Unity 
Editor and the core C# scripts to build a learning pipeline. In this work, the task of 
transporting orders, as described in Section 3.7.6.2, is simulated. The environment is 
the testbed in the research hall, and the agent is the drone. Initially, the drone needs 
to learn how much to rotate the motors enabling it to move some specific distance in 
a particular direction. Then, the drone computes the relative distance between itself 
and the target to decide what action to take next. The action of picking up the order 
and delivering it to the target area earns the drone a positive reward. Drone receives 
a fixed penalty (negative reward) for every other action. The goal is to maximize the 
rewards and minimize the penalty. Thus, the drone learns to swiftly pick up and deliver 
an object to avoid a heavy time penalty. 

A hierarchical approach integrates the swarm algorithms with RL algorithms. The 
swarm algorithms such as separation, cohesion, and obstacle avoidance have a higher 
weightage which ensures collision-free flight and the safety of the persons in the testing 
area. 
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An RL technique called Proximal Policy Optimization (PPO) is used in this work. PPO is 
a family of policy optimization methods that use multiple epochs of stochastic gradient 
ascent to perform each policy update [587]. PPO uses a neural network to approximate 
the ideal function that maps an agent’s observations to the best action an agent can 
take in a given state. A simple neural network (developed in TensorFlow) with an 
input layer, one hidden layer and an output layer is used. The network’s output is a 
vector indicating the direction the drone should fly in. Agents can ask for decisions 
from the policy either at a fixed or dynamic interval, as defined by the developer. The 
PPO algorithm is implemented in TensorFlow and runs in a separate Python process. 
Communication between Python and Unity takes place via a gRPC communication 
[718] protocol and utilizes protobuf messages. 

After training the RL agent for an hour, the trained model is saved. The saved model 
computes the actions of the drone in the testing phase. While testing, it is observed 
that the Micro-UAV can perform the transport task successfully using RL. Moreover, a 
hierarchical approach allows the swarm algorithms to be used with the RL algorithm. 
The simulations were performed on a single drone, but can be extended to the entire 
swarm. 


3.7.8 Conclusion 


This contribution describes a testbed for the development of resource-constrained 
Micro-UAV swarms. The testbed architecture combines external observation, visualiza- 
tion, and simulation systems with a drone control unit. The architecture in Figure 3.47 
successfully integrates external systems with drones and ensures fast data exchange us- 
ing radio signals. Swarm algorithms formulate a collision-free path for each drone. The 
Micro-UAVs successfully perform a transportation task using reinforcement learning 
in combination with swarm algorithms. Therefore, it is possible to integrate machine 
learning into the current setup, which opens up new opportunities in the use of ma- 
chine learning with resource-constrained drones. This work describes how drones can 
be integrated into a process environment and operated in a meaningful way. In addition, 
an automatic charging station extends the runtime of the swarm. 


3.7.9 Future Work 


Industrial use of micro-UAVs will increase as sensors become smaller and more efficient 
for the use on drones. Thus, on-board sensors will be used instead of a MoCAP system for 
accurate localization in the future. Future works will include autonomous exploration 
of an environment, which will involve the swarm creating a map from a safe starting 
point and updating it in subsequent flights. The recharge station described in this paper 
will form the basis for this work. Moreover, human-machine and machine-machine 
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interaction will be developed in the hall to create a safe environment for humans to 
work alongside drones. 

Various ML algorithms could be implemented directly on the drones to enable 
them to perform complex tasks. For example, ML could be used to learn the complex 
drone trajectories and motion patterns and thus could help to reduce the dependence 
on some of the systems in Figure 3.49. In the future, the different research halls could 
be equipped with a 5G installation that allows the drone swarm to fly between halls. 
Connecting the swarm via 5G to an already existing high-performance cluster for ML will 
enable larger-scale field studies for further development of the underlying algorithms. 


4 Smart City and Traffic 


4.1 Inner-City Traffic Flow Prediction with Sparse Sensors 


Thomas Liebig 


Abstract: The current traffic situation in urban areas and its forecasting are of interest 
to various application scenarios, as cities become more crowded and jammed. But the 
observation and monitoring of traffic situations are expensive, and thus estimates need 
to be imputed for unobserved locations and predicted for future locations. 


In this contribution, we focus on a situation-aware routing use case, which prevents 
traffic jams. This system needs to take into account real-time estimates of unobserved 
and future traffic and present several probabilistic approaches to estimating traffic 
quantities. 


4.1.1 Introduction: Problem Understanding 


Traffic congestions are crucial problems of urban traffic, both for logistics, and pas- 
senger traffic. Data-driven, dynamic control and a mobility shift to automated vehicles 
could potentially ease current problems and lead to a mobility change in urban envi- 
ronments. 

However, traffic systems are complex real-time systems with multiple actors, so 
control is difficult. Moreover, the observation of this process by measurements is sparse 
and prone to local (spatio-temporal) validity. Several real-time imputation and predic- 
tion steps are therefore prerequired, before the computation of meaningful dynamic 
recommendations for control is possible. 

However, if individual navigation could take the predictions of future traffic situa- 
tions into account, one would be able to avoid congested road segments or to decide 
on the best mode of travel in advance. Moreover, since some hazards such as traffic 
jams often occur by excessively high traffic densities, situation-aware trip planning 
would also cause fewer traffic congestions, and the infrastructure could be used more 
efficiently. 

The tasks posed by situation-dependent routing are 

— the prediction of future traffic situations from sparse observations, 
- the utilization of dynamic predictions in planning, and 

— the evaluation and selection of individual actions. 


3 Open Access. © 2023 the author(s), published by De Gruyter. (C) EXAM This work is licensed under the 
Creative Commons Attribution 4.0 International License. 
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226 —— 4 Smart City and Traffic 


In the following sections, we address these tasks one after the other and reflect the con- 
tributions we made to these questions within the CRC 876. The contribution concludes 
with a discussion that highlights future research directions. 


4.1.2 Traffic Prediction from Sparse Observations 


In data mining, learning a prediction model is a well-defined supervised learning task. 
Given labeled observations x, y from a space X x Y, we train a model f : X > Y such 
that the expected loss l between prediction f(x) and the truly observed label y becomes 
minimal. With this model, the prediction in a certain situation can be obtained by the 
application of the model to new data y = f(x). 

However, the modeling process is highly dependent on the available data, and the 
suitability of this approach depends on the entanglement of the modeled process. 

In traffic control, we have a dynamic spatial process, and observed data is valid only 
for a limited extent in space and time. Moreover, the basic assumption of the supervised 
learning approach that the process repeats and, therefore, that past observations are 
suitable to project the future, does not necessarily hold in general. Traffic control and 
individual decisions depending on the expectation of future traffic behavior are counter 
indicators of such assumptions. 

Based on these difficulties, various approaches to model traffic exist for different 
model assumptions, modes of transportation, and granularities. 


4.1.2.1 Gas Kinetic Models 

In contrast to the data-driven supervised learning approach, one could start by ob- 
serving patterns and physical properties of traffic and representing these in models. 
This model-driven approach is subject to the physics of transport and traffic theory. 
One of the basic observable properties of traffic is that individual moving objects (cars, 
pedestrians, etc.) do not disappear; rather, over time the number of objects entering 
spatial regions equals the number of objects leaving this region.’ By formulating of 
this observation in a conservation law and deriving a general description model, we 
obtain so-called gas-kinetic traffic models. Systems of differential equations model 
traffic similar to any liquid or gas. Prominent examples of these models are the Burgers 
Turbulence and the Navier Stokes Equation. In macroscopic traffic modeling (focusing 
on traffic at gross granularity), gas kinetic models are often applied. A possible numeric 
approximation is Force Based Models, which update individual states of particles based 
on impacting forces and inertia. The critics of these approaches to traffic modeling are 
manifold [271]: 


1 Note that this property requires a sufficiently long observation interval. For example in a living house 
people tend to rest and stay at night, while in a car park vehicles are stored until departure. 


4.1 Inner-City Traffic Flow Prediction with Sparse Sensors —— 227 


—  Ifvehicles interact, the impulse and the kinetic energy are usually not preserved. 
Thus, Newton’s Third law of motion (actio=reactio) is not applicable. 

— The temperature of a vehicle fluid cannot be matched directly, as it is the variance 
of the vehicle speed. 

— Vehicular gases do not move on account of external pressure but are caused by the 
inner intention to move at a certain speed. 

— Due to the various movement targets, separate flows in different directions occur 
and interact. 

— Vehicular behavior is anisotropic. 


4.1.2.2 Cellular Automaton 

Cellular automata are a widely used model of physical processes. Introduced by Neu- 
mann [473], a cellular automaton features discrete space and time, with transition 
rules defining the future state of a cell based on its previous state and the state of its 
neighbors. The advantage of cellular automatons over dynamics is their scalability. 
Accordingly, boundary conditions are often implemented in a cellular automata model 
because they have a natural interpretation at this level of description (e.g. particles 
bouncing back on an obstacle). For traffic modeling, the NAGEL-SCHRECKENBERG MODEL 
is widely used. 

A cellular automaton models a Markov chain, and the conditional probabilities of 
the future state are completely described by the current state. This Markov assumption 
does not generally hold for traffic at all granularities. Consider the movement on a 
highway or the queue at a traffic light. Without additional individual information on 
the past, one could not tell whether a vehicle leaves the highway or continues, whether 
the queue will start moving, and in which direction the cars turn at traffic lights. 


4.1.2.3 Hierarchy of Motion and Dependency Models 

Previous models describe traffic as a Markov process. Current traffic observation and a 
fixed number of past observations suffice to predict the future. This approach neglects 
the sociological and psychological aspects of traffic. Persons are traveling by purpose 
and follow a certain plan to achieve this. Hoogendoorn thus defines the hierarchy of 
motion [280], which represents the different aspects of trip planning. 

Observed traffic combines individual traffic plans and thus has a complex depen- 
dency structure which is hard to capture. Tobler‘s first law of geography [689] states that 
in a spatio-temporal process, geographically close observations are more related than 
distant observations. However, in traffic processes, this does not hold, as individual 
movement paths often start in a living area, use a larger street, and branch back to a 
tiny street where the working place is situated. So, the information about driving ona 
highway might be less informative to predict the goal of a trip than considering the few 
starting locations of the trajectory. 
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The dependency structure of the traffic process can be represented in a model as follows. 
Given any evidence of the presence of a moving object at several locations x;,..., Xj, 
we may model the likelihood of being at other places p(x;,..., X;|Xj,..-, Xj). If these 
dependencies are formulated without any circular dependencies, the joint probability 
distribution over the presence of a moving object at all locations factorizes as: 


PX.. Xn) = [[ pilpa) 


i=1,...,n 


where pa(x;) are the parents of x; in the above equation. This directed model is called a 
Bayesian network and can be represented by a graphical model (a graph consisting of 
vertices and connecting directed edges) encoding dependencies as arrows between the 
random variables (the vertices) pa(x;) > x;. A Bayesian network consists of a structure, 
given by the previous equation and the associated conditional probability tables for 
each variable based on its ancestors. 

The dependency model can be trained directly from data by comparing whether the 
trained dependency model represents the same distribution as the traffic observations 
using a suitable loss function to compare distributions, for example the Kullback- 
Leibler-divergence. But due to the vast amount of random variables (one per address- 
able location, e.g., a street segment), learning the model requires some relaxation 
and approximation. In [406, 407], we propose an algorithm to learn spatial Bayesian 
networks from traffic observations. 


4.1.2.4 Gaussian Processes 

By the central limit theorem, we know that observations converge towards a normal 
distribution when repeated multiple times or given a sufficient observation time. In the 
case of multidimensional observations (e.g., traffic counts at various locations in a street 
network) this converges to a stationary multivariate normal distribution. Assuming 
these traffic observations are generated by a probabilistic process, we can use the 
knowledge of the joint probability distribution to impute observations for unobserved 
locations (similar to the previous section). Under the assumptions described above, we 
may assume the observations were generated by a multivariate Gaussian distribution. 


P(E | X) = N(O, K) 


The generating process is completely defined by the covariance between the variables. 
Due to the finite number of traffic observations (e.g., measuring locations at street 
segments), we can denote their pairwise covariances in a kernel matrix K.2 When we 
observe some of these locations, we may impute the value at the other locations using 
the kernel matrix. 


2 For an infinite number of observations, a kernel function had to be applied. 
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The correlations in traffic values imposed by the traffic network can be represented 
in the kernel K. Assuming that a person randomly moves on the traffic network, she 
travels in the network from one location to another. The dependency structure generated 
by these random walks can be captured by the diffusion kernel [336], where L is the 
combinatorial Laplacian of the adjacency matrix, and À is a hyperparameter. 


R-u-u+0°I Kaa 
Ku,-u Ku,u 


5 a" r”! = exp (AL) 
ij 


This kernel models every route choice as equally likely. But knowledge of real trajectories 
can be used to weight the adjacency matrix, and the correlation model can be detailed 
further [409]. 
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A handy property of predictions made by a Gaussian process is that these predictions 
are normal distributions and are not just expectations. Thus, we may quantify the un- 
certainty of the predictions and use them to find near-optimal sensor placement [408]. 

However, the calculation of predictions requires an expensive matrix inversion 
step. With some preprocessing, the data can be grouped into nearly independent local 
chunks and the calculations can be distributed [111]. 


4.1.2.5 Markov Random Fields 
In previously described time-dependent models (fluid dynamics and cellular automata), 
the assumption is that future states are dependent on previous ones. This property 
is called the Markov property. If we now consider a field of random variables, and 
the Markov property holds for these variables, it is called a Markov Random Field 
(MRF). A Markov random field is similar to a Bayesian network in its representation 
of dependencies. The difference is that Bayesian networks are directed and acyclic, 
whereas Markov networks are undirected and may be cyclic. Thus, a Markov network 
can represent certain dependencies that a Bayesian network cannot (such as cyclic 
dependencies). By contrast, it may not be able to represent certain dependencies that a 
Bayesian network can (e.g., induced dependencies). The graph of a Markov random 
field may be finite or infinite. 

Any positive Markov random field can be written as an exponential family, such 
that the full joined distribution can be written as 


P(X = x) = C - exp < wx, fk > 
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A multivariate normal distribution forms a Markov random field with respect to a graph 
G = (V, E) if a correlation of zero corresponds to missing edges in the graph. 

A tailored method to model traffic with Markov random fields is the spatio-temporal 
random field [499], which penalizes complex structures in the learning process by 
regularization and thus hold a sparse representation of the underlying distribution. 
The Markov property holds over time: the next observation is completely defined by 
current sensor readings. An application of this model to traffic is presented in [475]. 


4.1.2.6 Poisson Dependency Models 

Empirical observations of spatial phenomena are often count values. For instance, 
density is the number of objects in a spatial area and flow is the number of objects 
passing a location in a given time interval. While previous probabilistic models primar- 
ily model categorical data or multivariate normal distributed observations, Poisson 
models seem to be a natural fit, as count values are neither binary nor continuous but 
are discrete with a right-skewed distribution over an infinite range [249]. A possible 
approach to combining graphical modeling with Poisson distributions is to use an 
ENSEMBLE Of POISSON REGRESSION TREES [249], each modeling a conditional Poisson 
distribution. With this model, the underlying joint distribution is unknown and local 
distributions might be inconsistent. Thus, Pseudo Gibbs Sampling [265] is required to 
impute unobserved measurements from given evidence. This algorithm initializes the 
unobserved variables arbitrarily at random and then updates these values according to 
the conditional distribution given its parent variables. After a burn-in phase, which is 
highly dependent on the initial distribution, this algorithm draws samples from the 
joint distribution. 

For the imputation of unobserved traffic values, this model outperforms expo- 
nential models in [247]. On a massive dataset, training these dependency models is 
challenging. In [451], the authors show how dependency networks can be trained on 
core sets, a compressed dataset that can be used as a proxy for the original data. For 
their algorithm, there is a proven guarantee that in the case of Gaussian dependency 
networks, the size of the coreset is independent of the size of the dataset. This property 
does not hold in general, i.e., for Poisson dependency networks, it does not hold. 


4.1.2.7 Conditional Sum-Product Networks 

The need for tractable inference in graphical Poisson models led to the adoption of sum- 
product networks [506] in Poisson distributions [452]. The graphical structure of these 
sum-product networks consists of a tree having alternating sum and product nodes in 
the layers and Poisson-distributed random variables in the leaves. For inference, the 
structure just needs to be traversed once from the bottom to the root. For training the 
model, independences between sets of random variables are estimated. In case of no 
independence, similar objects are grouped into clusters, and a sum-node is introduced. 
In the case of independences, a product-node is constructed. 
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For modeling traffic, a model that combines the physical properties (as represented 
by a cellular automaton) with probabilistic data-driven models would be beneficial. 
In [597, 598], we show how to model Markov processes with sum-product networks 
by conditioning them on the previous time slice. The model can represent cellular 
automaton, is data driven, and outperforms previous Poisson dependency networks 
for traffic prediction. 


4.1.2.8 Differentially Private Learning from Label Proportions 

Traffic data is usually collected in a centralized manner, which results in high data 
transfer and data protection risks. It is especially important that data protection risks 
are addressed by institutions, due to the introduction of GDPR in all EU countries in 
2018 [231]. Organizations want to use this data in order to gain more information or 
predict future sensor states, e.g., “Will the traffic flow stay the same over the next 15-30 
minutes?” Accordingly, they also have to be compliant with GDPR. 

Therefore we extend the decentralized learning approach from [651, 655] by apply- 
ing differential privacy to label proportions sent between the different decentralized 
sensor devices resulting in a privacy-preserving algorithm. In general, the Learning 
from Label Proportions (LLP) algorithm stays the same as proposed in [651]. Because 
it is important to know the structure and flow of the algorithm, we will briefly consider 
it further. There are m wireless sensor nodes (n1, N2,... Mm), which store their mea- 
surements in D(i) Vi € 1...m. Each row in D(i) consists of [t — w, t] measurements, 
where t denotes a timestamp and w is the window size of the last w measurements. Each 
row is assigned a label, which is taken from a measured value from a future timestamp 
t +r. In the first place, those measurements are split into batches B,,..., Bp where 
h = [|D(i)|/b] and b denotes the size of the batches, in which D(i) will be divided. The 
batches are then used to calculate label proportions for each batch. The generated label 
proportions are sent to the closest c neighbors. Each node uses the received label pro- 
portions to train c + 1 models fj), where k € 1,...,c¢+1 and jis the current node. The 
prediction is made by doing a majority voting of the c + 1 trained models. This approach 
has the advantage that we can make use of more than only local measured data point 
while keeping the bandwidth of transferred data low because only aggregated data 
is sent between the nodes. However, privacy cannot be guaranteed by this approach. 
Assuming we have traffic flow measurement values, with labels 0, 1, 2, 3, 4 and overa 
time frame of size b only label 4 is present. Then, from the label proportion, it can be 
inferred that everyone drove that fast during the period. 

We solve this issue by applying differential privacy to the label proportions. Firstly, 
we have to calculate the l; -sensitivity function to know, how much influence a single 
data point can make on the output of a function f : D > R: 


Af= max ||f(D)-f)|\1 (4.1) 
€D, 


’ 


||[D-D"||1=1 
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For this scenario, D is the current batch B;, and R is the resulting label count. Consid- 
ering that we have a simple counting query, a single data point can have at most the 
influence of 1 (see [185] example 3.1). Finally, we can use the Laplace distribution to 
generate noise, which can be added to the label counts to be privacy-compliant under 
e-differential privacy [185]: 


ben] 


1.3 
lap(x, 0, u) = aae (4.2) 
A _ ole 
lap(x, 8 as Fe (4.3) 


In the formula above, the position parameter y is set to 0, and the scale parameter is 
set to De These parameters have to be set like this to be compliant with the differential 
privacy definition (proof can be found in [185] Theorem 3.6). 

The modified algorithm for calculating label counts can be seen below. As men- 
tioned before, the batches B; are already generated, and possible labels Y are also 
known. The output Q(j) contains differentially private label proportions for all batches. 


Algorithm 3: 
Input: B1, ..., Bh, Y 
Output: Q(j) 
1 QU) & matrix(h, |Y|); 
2 foriini..hdo 


3 for jin 1..|Y| do 

a | | Qi; € sum(B; == Y;); 

5 end 

6 // adding noise to label counts 

7 m © sum(Q(j);); 

8 for jin 1..|Y| do 

9 QG): j < QG); j + lap(e = 0, s = 1/e); 
10 clip Q();,; to bounds [0.001, m]; 
u normalize Q(j);; 

12 end 

B end 


Initially, Q(j) is created with dimensions count batches (h) and count possible labels 
(|Y|). Afterward, the label proportions are calculated iteratively for each batch as follows. 
First, the label counts (see lines 3-5) and the total sum (see line 7) are calculated. Then 
the Laplace noise, which is calculated by the sensitivity and e, is applied. Afterward, 
the new value is clipped to the maximum bounds to prevent values that are too large or 
negative. Finally, the label counts with noise are normalized. The resulting proportion 
is stored in Q(j). 
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The proposed approach is based on the existing LLP algorithm to use the decentralized 
properties and extend this approach by applying differential privacy to the transferred 
data. This yields reduced data transfer and increased privacy [568]. 


4.1.3 Efficient Routing with Dynamic Predictions 


In a changing world, geo-spatial data is subject to dynamic changes, and geo- 
information systems are required to incorporate real-time updates in their analysis and 
computations. In this section, we focus particularly on route-planning systems. While 
in a static world, many algorithms exist to compute the (shortest) path from a starting 
location to a target location efficiently (see Section 4.1.2), this problem becomes more 
difficult in the case of multi-modal trip planning, as with public transport, because tem- 
poral constraints, e.g., transit times and departure times, need to be incorporated. In 
the real world, these static schedules are not met, but delays occur [439], and deviations 
from the schedule can be observed. The incorporation of this dynamic information in 
route computation is beneficial, as it provides tractable travel recommendations to 
the public. The dynamic information on the delays can be achieved by monitoring the 
positions of the vehicles and by predicting future delays. This enables proactive trip 
computation. 

In this section, we focus on the tractability of dynamic transit computation. Existing 
single-source shortest path computation algorithms for the dynamic transit problem 
suffer from their long computation time. Transfer Pattern, a very fast route planning 
algorithm for transit networks, does not guarantee soundness in case of real-time delay 
information. Our approach [411] overcomes these shortcomings and introduces dynamic 
transfer patterns, a data structure that encodes which novel transit possibilities are 
enabled due to the delays. 

In comparison with existing dynamic transit-routing schemes in the city of Warsaw, 
we highlight the performance gain using our method. Our findings are implemented in 
the commonly used open-source trip planning framework OpenTripPlanner. 

Here, we focus on the point-to-point shortest path problem [49], where in a graph 
G = (V, E) a path between a source s € V and target t € V needs to be found such 
that the cumulative edgewise cost I(u, v), with(u, v) € E C V x V along the path is 
minimized. 

The standard solution to the problem is Dijkstra’s algorithm [176]. Given the graph 
G = (V, E) and s, t € V, it initializes a queue of nodes Q = V and a distance function 
over V x V with dist(s, s) = O and dist(s, v) = œ,Yv # s,v € V. Until the queue is 
empty, the node u with the smallest distance dist(s, u) is picked and removed from 
Q. For each neighboring node of u, the distance is updated as follows: dist(s, v) := 
dist(s, u) + I(u, v), if the latter is smaller than the former. Dijkstra’s algorithm can be 
sped up by running it simultaneously from both s and t until a common node u is hit. 
In the slightly modified version of Dijkstra’s algorithm A* [258], the order in the priority- 
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queue for the traversal not only depends on the cumulated costs to reach a vertex in 
the graph but also on the expected costs to reach the goal from this vertex. Bound by 
Minkowski’s inequality, whereas ||x + y||p < ||x||p + ||y||p (known as triangle inequality 
for p = 2), A* prunes the search space in comparison with Dijkstra’s Algorithm. A sound 
heuristic for the remaining cost estimation is the geographical distance that is always 
lower than the road-based distance. 

In the case of static cost functions, contraction hierarchies [223] are a data structure 
that speeds up the A“ algorithm and enables trip calculation in large traffic networks. 
Instead of searching the shortest path directly within the traffic network, contraction 
hierarchies reduce the search space to the most important connections. In a prepro- 
cessing step, these important segments are identified (based on the topology), and the 
network is extended by edges between these important links. 

In contrast to regular road networks, public transportation data enhances a spatial 
graph with temporal data by adding timetable information. A trip T serves a sequence 
of stops stops(T) = (s1,..., Sn), S; € S. T connects two stops Sa and s, if and only if 
stop(T, Sa) < stop(T, sp). If multiple trips contain the exact same sequence of stops, 
they form a line [47]. 

A common approach is to model the dynamic into the graph G and then to apply 
Dijkstra’s algorithm. This results in a time-extended and time-dependent model. In the 
time-extended model, every transit node is split into multiple vertices for each event 
(arrival, transit, and departure). The time-dependent model assigns every transit node 
one vertex, and arcs encode temporal constraints. 

A data structure and algorithm, Transfer Patterns, introduced by Hannah [48] is 
considered state of the art in public transport routing. Based on the assumption that 
during a day, there are only a few optimal routes from stop s to stop t that differ only 
in the time they take place. In a preprocessing phase, optimal routes are computed 
as a sequence of transfer stations, neglecting the time component and information 
about intermediate stations. For each origin and target destination a directed acyclic 
graph is saved, containing all routes starting with the destination and containing all 
intermediate stations until the origin is reached. 

In a realistic route planning scenario, various delays occur amongst the public 
transport vehicles. In contrast to vehicular traffic, trams and trains cannot overtake each 
other, and vehicles in transit networks wait for connections (e.g., connecting trains). 
This causes delays to propagate differently than vehicular traffic jams. In addition, 
two modes of transportation may share the same physical resource (e.g., buses or 
trams riding on a vehicular street). Thus, two forms of delays in transit networks are 
distinguished in literature: 1) a vehicle is late due to its own reasons, and 2) other 
vehicles are late caused by the former [462]. 

Several models for transit delays are reported in the literature. The work in [175] 
assumes independence. By contrast, [232] allows delays to cumulate. Sophisticated 
models incorporate dependencies among the vehicles into the delay [276]. In [439], the 
delays are analyzed visually. 
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In a trip planning application, real-time predictions of delay are the main benefit, as 
future delays may influence the route choice. Thus, we highlight two recent works on 
delay prediction and delay recognition: [217] applies queueing theory and assumes 
delays to aggregate, and [774] detects delays and unexpected vehicle movement in real 
time from the GPS traces. 

In this work, we do not focus on the prediction but assume that we have information 
on delays of vehicles (in the commonly used GTFS real-time data format) either from 
vehicle observations or predictions. 

With such dynamics, the trip computation becomes more difficult. Though a previ- 
ous publication [47] states that transfer patterns are delay robust, this only holds as 
long as no new transfers are enabled by the delay. In the likely case that novel transfers 
are enabled, the existing transfer patterns do not represent this information and cannot 
result in the optimal transit route. 

Transfer Patterns were introduced in [48]. The method comprises a data structure 
and an algorithm for fast transit route computation. In a preprocessing step, all possible 
connections are pre-computed and stored in a compressed format. For each public 
transport line, a table is stored, denoting in the columns the stops along the line. In 
this way, it holds the maximal possible route without changes. 

Our approach to the dynamics of the transit information is to incorporate potential 
delay information already in the pre-computation phase, and add additional transfer 
possibilities to the DAGs created during transfer pattern creation. 

As we aim to apply the transit route computations in an industrial context, we 
extend the capabilities of the existing open-source platform OpenTripPlanner (OTP). 
Our dynamic transfer patterns outperform the algorithms previously available in OTP 
A` [258] and RAPTOR [165] by an order of magnitude [411]. 


4.1.4 Control and Planning of Individual Actions 


Urban areas are increasingly subject to congestions. Most navigation systems and 
algorithms that avoid these congestions consider drivers independently and can, thus, 
cause novel congestions at unexpected places. The precomputation of optimal trips 
(Nash equilibrium) could be a solution to the problem but due to its static nature is of 
no practical relevance. By contrast, we describe an approach to avoid traffic jams with 
dynamic self-organizing trip planning. 

In [412], we apply reinforcement learning to learn dynamic weights for routing from 
the decisions and feedback logs of the vehicles. In order to compare the routing regime 
against others, the validation uses an open simulation environment (LuST) that allows 
the reproduction of the traffic in Luxembourg for with varying penetration rates. All of 
these experiments reveal that the performance of the traffic network is increased, and 
the occurrence of traffic jams is reduced by applying our routing regime. 
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Traffic closely resembles a bandit feedback learning environment (see [33] for an in- 
troduction to bandit learning). Bandit learning is a reinforcement learning task where 
the behavior of some blackbox (e.g., a bandit) should be learned only by the feed- 
back we observe. Several actions can be taken (in the bandit problem, this equals 
drawing an arm). However, only the result of the actions can be observed, and it is 
unknown what would have happened otherwise. Vehicles serve as agents that move 
in a road network. The actions are represented by the roads a vehicle can choose at 
an intersection. Once a road is chosen, a reward will be assigned for that particular 
road depending on its actual state. The reward for all other roads that could have been 
chosen remains unknown. This lack of fully labeled data makes a supervised learning 
approach particularly complex. 

The Policy Optimizer for Exponential Models algorithm (POEM) [666] is able to learn 
solely based on the reward values provided by the environment. Additionally, POEM 
does not perform online learning but rather uses logged data. This abstraction is known 
from bandit problems, which seek to optimize a reward from the sole information 
gained after turning the arm of the bandit. This presents a more robust approach, since 
a learned model can be thoroughly tested before deployment. The system will also 
not evolve over time, which could lead to unpredictable behavior. This is particularly 
undesirable in the context of vehicle routing. 

In [666], POEM assigns a structured output to an arbitrary input based on its prob- 
ability of being correct. Therefore, before applying POEM to congestion avoidance, a 
suitable mapping of the routing problem to a policy ho, along with an input space X 
and output space Y, must be modeled. Additionally, a cardinal loss feedback map- 
ping 6 is required, which serves as the reward function for all selected input/output 
combinations. 

The input space X was chosen as X := [0, 1]. Here, each X = (x1,...,Xm)! € X 
represents a feature vector of (normalized) sensor measurements for a road segment. 
For instance, a road’s density, occupancy, mean speed, vehicle count, or waiting time 
can be used. Any value not in [0, 1] was scaled using min-max scaling. 

The output space must be a set of suitable, structured outputs. As POEM should be 
applied to the problem of congestion control, a single label indicating whether a road 
is congested or not already provides adequate results. Thus, let Y := {(0), (1)}, where 
(0) indicates a road is not congested and (1) corresponds to congestion. 

The policy ho(Y | X) is a probability distribution over the output space. In other 
words, it assigns a probability to each output y given any input xX based on how likely y is 
to be correct under conditions x. Hence, predictions are made by sampling y ~ ho(Y | x). 
The goal of POEM is then to improve this policy. Initially, no such policy exists for the 
constructed input and output spaces. This is a common problem when applying POEM. 
Therefore, a default policy is used (compare [666]). Let ho(y | X) := 0.5, meaning both 
labels are assigned a probability of 0.5 for all x. 

Lastly, in order to improve an existing policy, POEM requires a cardinal loss feed- 
back mapping 6 : X x Y > R. This was achieved by applying one of the following 
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two primitive congestion detection methods to the sensor readings: the primitive den- 
sity congestion metric, 6gensizy, assumed a road to be congested when its density was 
greater than one-seventh of its jam density [124]. The primitive mean speed congestion 
metric Ôspeeq would assume a road as congested when its mean speed was less than 
ten kilometers per hour of its allowed maximum speed. 

In order not only congestion but also reduce it, vehicles must receive frequent 
information updates about the current state of the road network. Then, POEM is used 
to predict the next state of the road network. This information will be used by vehicles 
to bypass roads which are deemed congested. Thus, those results must also be applied 
in a routing algorithm, such as Dijkstra or A’. 

Let G = (V, E, c, q) be a graph representing a road network. Here, c and q are the 
default cost and heuristic functions. Additionally, assume all vehicles have knowledge 
about a congestion labeling policy h € Hiin U {ho} [666]. When using dynamic routing, 
vehicles will receive updates about roads at regular intervals T € N. The update can 
then be written as ur : E > X.3 Then, when a vehicle receives update ur, it is able to 
predict how likely a road is to be congested during interval T + 1 using h. 

The described model receives sensor information only about whole road segments, 
rather than individual lanes, which might be problematic, as congestion does not 
always arise on every lane equally. That challenging situation is most likely to occur 
at junctions where each lane will allow a vehicle to go in a different direction. We 
address this problem by aggregating sensor data for each connected edge pair (for the 
use of a line graph of G, see [257]). Additionally, the resulting data allows more precise 
congestion detection as individual turning lanes are separated in the model. 

In order to bypass arising congestion, a vehicle must recalculate its route with 
respect to the newly received update ur. This is achieved by increasing the weight of 
an edge that is likely congested: 


Pi, ez) = A((O) | 0.5ur(e1) + 0.5ur(e2)) (4.4) 


c(e2) 
0 
(e1,e2) 


The denominator shows the previously mentioned aggregation of sensor data. For 
notational simplicity, c’ is defined for all elements of E?. However, in practice only a 
subset of E? is used where e4 is incident or equal to e2. 

The function c’ calculates the new weight of an edge ez depending on the preceding 
edge that was reached. For instance, a vehicle on edge e; = (u, v) would calculate the 
weight for edge e = (v, w) using c (e;, e2). A vehicle that starts its route on edge ez 
would use c (e>, e2). 

Essentially, c divides the default weight of an edge by its probability of not being 
congested in interval T + 1. This means the weight of an edge will remain almost 


c: E? > R, (e1,e2) > (4.5) 


3 Here, it is assumed that updates are received equally for all edges. 


238 —— 4 Smart City and Traffic 


unchanged when no congestion is expected. The increase will conversely depend on 
how likely congestion is to arise. 

Finally, it was assumed that sensor data updates are available for every road. In real- 
world road networks, permanently installed sensors are much more scarcely distributed 
throughout the network. This problem can be partly alleviated by directly implementing 
sensors in the vehicle (e.g., using navigation applications provided by smartphones, or 
self-driving cars). However, some roads will still remain uncovered. Here, ur can map 
to {0}™. For the previously defined features in X (a road’s density, occupancy, mean 
speed, vehicle count, and waiting time), its dimension m would be equal to 5. This 
will cause h to assign a probability of 0.5 to both labels (as defined by H);, in [666]). 
Another solution might be to map up to the average of all sensor readings in an interval. 
Thus, uncovered roads would reflect the average state of a road network. 


Logging 

For POEM, no interactive control over actions is required, as it was specifically designed 
to learn using logged data. Hence, with respect to the previously defined setting, POEM 
requires a dataset: 


D := {(Xi, Yi, ôi, pi) |i € Ncn}, Di = h(i | Xi). (4.6) 


This dataset will be created during the logging phase. All edges are assigned weights 
using c’, and routes are calculated using an implementation of A*, which produces 
the shortest routes for any admissible heuristic. Additionally, POEM is initially applied 
using the default policy ho, which will scale all weights equally by a factor of two. The 
scaling will not affect A*, meaning no route changes will occur, which in turn simplifies 
learning on previously collected data. 

The data itself can either be collected by each vehicle or by a centralized authority 
monitoring each vehicle. For both approaches, a data entry cannot be created before 
any feedback is available. Thus, intermediate results must be cached. 

First, the aggregated feature vector X; is logged. The respective label y; with its 
corresponding probability p; are then determined using: 


(0), A((0) | X;) > 0.5 
Vi = < (1), h((1) | Xi) > 0.5 (4.7) 
random((0), (1)), otherwise 


Here, random((0), (1)) means a label is chosen randomly and uniformly distributed. 
Lastly, the feedback is logged using either density OF Sspeeq. The respective results will 
inherently depend on the previously chosen label. 

The deployment of our self-organizing routing algorithm in an urban area could 
be done in two ways. One option is to use the data of an existing stationary traffic 
information system (e.g., a SCATS [324] system) and feed it into a navigation platform 
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that can be used by the citizens. The other option is to turn vehicles directly into sensors 
and retrieve segment-wise statistics on travel time, density, and traffic flow directly from 
the navigation app. In the latter case, one might be worried about individual privacy 
because mobility statistics are recorded centrally. However recent work [410] provides 
an approach to protect individual privacy, known as homeomorphic encryption. This 
approach encrypts the data such that it still allows for the analysis of the crypto-text, 
but just the result can be decrypted. In the following, we will test these two deployment 
settings using stationary and moving sensors and compare them to Nash equilibrium 
and uninformed routing. 

For the comparability of experiments with different routing algorithms, it is es- 
sential to guarantee the same traffic demand (i.e., origin/destination pairs) over time. 
For repeatability of the same origin/destination setting, we perform analysis with a 
microscopic traffic simulator called SUMO [349]. The simulator models individual ve- 
hicles on a microscopic level, so it controls also acceleration and deceleration, and 
is largely applied in traffic simulation and applications. It allows us to control traffic 
demand and provides us with complete knowledge of the performance of the street 
network and the routing performance. In contrast to arbitrary toy experiments, we aim 
at modeling sound traffic scenarios. We use an open simulation scenario in the city of 
Luxembourg [143], which enables the reproduction of 24 hours in the city’s mobility. 

The common procedure of SUMO is to generate the route of each vehicle before the 
simulation starts, which is why its live routing capabilities are rather limited. However, 
SUMO provides the Traffic Control Interface (TraCI), a network interface that allows 
full control over the current simulation. We used this to implement a Java application 
(SUMO-CA) that simulates a central authority. In order to calculate vehicle routes, 
SUMO-CA loads a road network and converts it to a directed, weighted multi-graph. 
When running a simulation, SUMO-CA will receive and parse sensor measurements 
in regular intervals. This information is utilized to predict the next state of the road 
network using POEM. Finally, those results are used to update vehicle routes. 

Unless stated otherwise, each experiment will start at 7:45 (simulation time) and 
runs over a period of roughly 35 minutes, or exactly 2048 seconds. The reason why 
this particular window was chosen is that roads generally are more susceptible to 
congestion during rush hour. Additionally, a size of 2048 seconds allows rerouting 
intervals to be easily scaled using a factor of two. Finally, in order to create more realistic 
jams on arterial roads, SUMO was set to scale the original demand by a factor of 1.3. 

Evaluating vehicle detours is problematic. Neither absolute nor relative differences 
will adequately represent measured detours. The reasoning behind this is that long 
routes will allow longer, absolute detours, whereas, short routes will allow longer, 
relative detours. Hence, a different metric is required. We propose the usage of the 
weighted relative detour as follows. 

Let y4, Yg € R3 be arbitrary measurements of one vehicle when algorithms A and 
B are applied, respectively. Then weighted relative detour diff,,, will then calculate the 
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relative difference, while weighting it using the absolute difference. 


diff (Ya yB) = Ya -ynl * A (4.8) 


Various charts in [412] present the evaluation results. The baseline is an uninformed 
Uniform Cost Search (UCS), where each road was assigned its static default weight, and 
every vehicle chooses its path individually by A’. In this case, congestions are likely to 
appear. In addition, a Nash equilibrium (NASH) is shown as a baseline. 


4.1.5 Discussion 


Throughout this work, we highlighted various models to estimate traffic predictions 
under different model assumptions and properties. The models represent different 
aspects of traffic at various granularities. 

As an example, modeling car-to-car or vehicle-to-infrastructure interactions in 
inner cities requires different spatio-temporal granularity and thus different model 
assumptions from a macroscopic daily average traffic flow prediction. 

In general, traffic is a chaotic system and the commonly applied Markov assumption 
is often violated in practice. Future traffic does not only depend on a fixed number of 
previous observations. Consider, for example, a semaphore in traffic system (a traffic 
light, a barrier, or a large public parking). In these situations, it is easy to see that, 
though following certain patterns, traffic is chaotic. 

In Google Maps, the inherent structure of traffic data is currently modeled by 
Graph Neuronal Networks [172, 546]. However p-adic models are also a promising 
technique at fine granularities to represent the chaotic behavior. Since it is important 
for production-ready systems that dynamic predictions are tractable, condiditonal 
sum-product networks [597] are also an interesting future research direction. 

We also applied the algorithm used for self-organizing control of navigation plans 
to control the charging prizes of electric mobility [543]. In this application, we observed 
that the explored states of the system might be bad for the system provider. As an 
example, in our experiments, reduced and even negative energy prizes could provide a 
useful incentive to prevent grid burnouts. However, the total financial risk needs to be 
bounded. Such constraints could be incorporated into reinforcement learning using 
stabilities. 
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4.2 Privacy-Preserving Detection of Persons and Classification of 
Vehicle Flows 


Marcus Haferkamp 
Benjamin Sliwa 
Christian Wietfeld 


Abstract: In some places, the continuously increasing road traffic will soon exhaust 
the capacity of existing traffic infrastructure unless appropriate measures are taken. 
Especially in urban environments with a high density of residential and commercial 
properties, the infrastructure is highly utilized or overloaded during peak hours. Since 
structural measures are often not possible or only at great expense, a practical solution 
to counter this issue is to optimize the infrastructure utilization and the control of traffic 
flows. For this purpose, the widely installed Internet of Things (IoT)-powered Intelligent 
Traffic Systems (ITS) can be used, which enable automated detection and high-precision 
classification of different road users and thus transform the infrastructure into a data- 
driven Cyber-Physical System (CPS). 


Although various sensor systems have been proposed, they fulfill only subsets of the 
requirements, including accuracy, cost-efficiency, privacy preservation, and robustness. 
One approach that meets those requirements is a novel radio-based sensor system, of 
which we present two variants in this contribution. The system’s fundamental idea 
is to exploit radio-based fingerprints of road users—multi-dimensional and charac- 
teristic attenuation patterns of several radio links—for detection and classification. 
One of the presented system variants additionally evaluates high-precision channel 
information extracted from Wireless LAN (WLAN) Channel State Information (CSI) or 
Ultra-Wideband (UWB) Channel Impulse Response (CIR) data. The proposed solution 
benefits from increased robustness against a wide range of interferences, e. g., poor 
visibility due to bad weather conditions. Moreover, the system exclusively uses em- 
bedded microcontroller units (MCUs) and radio technologies, allowing compact and 
cost-efficient installations in rural and dense downtown areas. 


We have performed comprehensive field measurement campaigns and machine 
learning-enabled analyses that confirm the presented approach’s high suitability for 
different requirements and application scenarios. In this regard, we have evaluated 
multiple applications, including the comparatively simple detection of road users and 
the fine-grained classifications of several vehicle classes. For instance, the proposed 
systems achieve more than 99 % for binary classification and 93.83 % for differentiating 
seven vehicle types. 
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4.2.1 Introduction 


Following the current trend, it is expected that a large part of the existing transport 
systems will reach their capacity limits in the (near) future. Possible reasons include 
the approval of new types of personal transport (e. g., e-scooters) and the shift to al- 
ternative means of transportation. There are essentially two options to counter this 
problem without stricter regulatory measures for road users: structural measures to ex- 
pand current capacities and the efficient utilization of existing infrastructures through 
optimized traffic flow control. However, the former step is often not an option due to 
high financial costs and additional long-term restrictions caused by the construction 
measures. Instead, more efficient traffic flow control is a realistic undertaking thanks to 
sensor information provided by the vehicles themselves and/or to low-cost IoT compo- 
nents, especially in smart cities. Such systems also collect high-precision and vehicle 
type-specific information, paving the way for novel and more advanced optimization 
methods (e. g., type-specific lane assignment or routing and smart parking). For this 
purpose, the systems must always provide up-to-date and area-wide precise traffic infor- 
mation, which is collected, among other ways, by a sensor network installed over a large 
area. Next to high accuracy, these systems must also meet other requirements. They 
should function reliably in challenging weather and traffic conditions while protecting 
road users’ privacy and be energy- and cost-efficient to operate. In some countries, 
compliance with all these points is a prerequisite for being approved for large-scale 
installations in road traffic. For instance, some sensor solutions are unsuitable for 
this use case because of their characteristic weaknesses—e. g., privacy concerns when 
using camera-based sensors. An increasing number of vehicles is also equipped with 
GNSS (Global Navigation Satellite Systems) and mobile network connectivity, which 
provide detailed information about the current position of a vehicle in real time. In 
the work of Nieh6fer et. al. (for example [476, 477]) it has been shown within the CRC 
that the accuracy of the vehicle position can be enhanced through in-depth system 
simulation to provide lane-specific positioning information of vehicles. Yet, any system 
that collects such location information about individual vehicle tracks raises privacy 
concerns. 

Therefore, this contribution presents a novel Wireless Sensor Network (WSN) for 
detecting and classifying different types of road users, which identifies those based on 
characteristic inferences of the signal strength of a radio signal (fingerprint). Initially, 
the wrong-way driver warning system [250] has leveraged the idea of inferring the 
travel direction of passing vehicles based on the time sequence of the radio links’ 
attenuation. We have enhanced this approach to determine vehicles of certain types 
utilizing class-specific fingerprints induced by their shapes and materials. We use 
supervised learning techniques to extract such class-specific similarities from the 
channel information for the evaluation. Specifically, in this contribution, we present 
two generations of the novel detection and classification system. The focus here is on 
the first generation, which correlates the Received Signal Strength Indicator (RSSI) of 
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many diagonal and transverse radio links. It can reliably infer different vehicle types 
based on this information [252, 621]. 

Furthermore, we present a modular and more compact system design that adds fur- 
ther high-precision channel information using WLAN CSI and UWB CIR radio technolo- 
gies in addition to coarser information [251]. Since our research’s focus has primarily 
been in the context of the initial system design, this section is devoted to the original 
system. Also, it provides a brief outlook of the successor system. 

Figure 4.1 shows the presented IoT-powered sensor system’s intended information 
flow and its use in a smart city context. Here, the communication modules acting 
as sensors record fingerprints of passing vehicles and preprocess this raw data and 
the classification task. One could use such exact traffic information in two different 
application scenarios. In on-site applications (e. g., parking-lot balancing, wrong-way 
driver detection), the acquired data is evaluated immediately on-site and serves as a 
trigger for further actions (e. g., warnings of wrong-way drivers). By contrast, global 
applications aggregate this locally relevant information to enable predictions and 
recommendations for larger areas. Finally, the widely deployed sensor systems can 
dynamically adjust their predictions by periodically verifying the prevailing traffic 
situation. 
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Fig. 4.1: Overall system vision: Embedding of the proposed loT-based sensor system in a smart city 
environment. All sensor deployments are locally exploited for on-site applications and contribute 
their data to the global data-driven ITS applications ©[2020] IEEE. Reprinted, with permission, from 
[634]. 
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4.2.2 Related Work 


In this section, we provide an overview of existing systems and technologies for vehicle 
detection and classification. Figure 4.2 shows the abstract process flow and the main 
logical components of such systems, starting with the gathering of vehicle traces up to 
the final classification task. The sensor technology generates accurate traces as contin- 
uous and high-rate data streams, of which only a part is relevant for the subsequent 
process steps. To determine suitable sequences within these traces, a detection stage is 
typically connected afterward, reducing the overall workload. Based on these selected 
sequences, relevant (statistical) features are then extracted, which serve as input for 
supervised classification procedures using well-defined taxonomies. 
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Fig. 4.2: Abstract system model of vehicle classification systems, ranging from data acquisition 
using different sensor technologies to the final classification tasks ©[2020] IEEE. Reprinted, with 
permission, from [634]. 


4.2.2.1 Taxonomies for Classification of Vehicles 

The Federal Highway Administration (FHWA) proposes a 13-class scheme for classifying 
vehicles mainly based on the number of axles [327]. However, this approach’s disad- 
vantage is that the number of axles does not indicate the vehicles’ exact dimensions. 
For example, accurate vehicle length information is essential for providing reliable 
parking space balancing or parking guidance systems. 

An alternative taxonomy is the Nordic System for Intelligent Classification of vehi- 
cles (NorSIKT) [698], which is used in Scandinavian countries and, with its hierarchical 
approach, provides different gradations. 

The ISO 3833-1977 standard, the 2007/46/EC Directive of the European Parliament, 
and the European New Car Assessment Programme (Euro NCAP) provide different 
schemes for classifying vehicles. 

Nonetheless, many academic approaches use individual classification schemes 
to evaluate the performance of the proposed systems. Following this example, we 
have developed an adapted scheme with different refinement degrees in this work (cf. 
Section 4.2.4.1) and applied it to the final performance evaluation. 
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4.2.2.2 Sensor Technologies Used in Vehicle Detection and Classification Systems 
This section provides an overview of established sensor technology used in vehicle 
classification detection and classification systems. Although much of the referenced 
work provide performance evaluation in terms of typical metrics—mainly classification 
accuracy—we want to note that comparing different solutions is difficult. Reasons for 
this include diverging taxonomies, different approaches for data preparation, anda 
variety of ML methods—e. g., Support Vector Machine (SVM) [147], Random Forest 
(RF) [102], k-Nearest-Neighbor (kNN), or Artificial Neural Network (ANN) [236]—used 
for analysis in the respective works. The sensor systems used can be broadly classified 
as intrusive or non-intrusive, respectively. 

Intrusive systems represent the classic solution approach and are typically em- 
bedded in the road surface gathering technology-specific measured parameters. The 
used sensor technology directly affects the type and scope of measures required for 
installation or maintenance. While a minimally invasive cutting of the road surface is 
sufficient for some systems, more extensive and costly measures are necessary for other 
approaches. Representatives of this system category include Weigh in Motion (WIM) sys- 
tems, inductive loop detector (ILD)—using one [145] or more inductive loops [365]—fiber 
Bragg grating sensors [674], vibration sensors [747], and piezoelectric sensors [519]. 

Non-intrusive systems include acoustic sensor systems, inertial sensors, camera- 
based approaches, and radio-based solutions. Acoustic sensor systems identify road 
users based on the emitted sounds. The fundamental challenge for these systems is the 
extraction of the relevant signal component from the dominant noise caused by the 
traffic noise. However, studies of acoustic sensor systems have shown that their use is of 
limited value due to comparatively low classification accuracies [225]. By adding other 
sensor technologies, the precision of these systems is significantly increased [157]. 

Different types of inertial sensors, such as accelerometers, gyroscopes, or mag- 
netometers, are often combined on an inertial measurement unit (IMU). For vehicle 
detection and classification, the IMUs are either installed directly on the road’s surface 
or at its side. One approach is detecting the number of axles of a passing vehicle, from 
which the vehicle class is deduced. For example, such systems achieve accuracies of 
98.98 % for detection and 97 % for length-based classification [40]. 

Camera-based systems use pattern recognition and image processing techniques 
and are widely used due to their high precision. Apart from the detection of road users 
and the classification of vehicle types, the available high resolutions also allow a 
reliable recognition of vehicle makes [612], which can be problematic due to regulatory 
requirements to protect the privacy of road users. Most of these photosensitive systems 
use ambient light, so these systems’ performance varies significantly with the day 
or visibility conditions. Using a Convolutional Neural Network (CNN), the approach 
presented in [178] achieves an accuracy of 95.7 % in daylight and 88.8 % in darkness 
for the classification of six types of vehicles. 
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In contrast to RSSI, the evaluation of WLAN CSI provides frequency-specific channel 
information. Orthogonal Frequency-Division Multiplexing (OFDM)-based radio tech- 
nologies such as IEEE 802.11 use this information to estimate a channel’s interference— 
e. g., multipath propagation—and reconstruct the original symbols. Depending on the 
number of transmitting and receiving antennas and the channel bandwidth, between 
64 and 512 subcarriers are sent in a data packet’s training fields. The receiving unit can 
infer the radio channel’s interference by comparing amplitude and phase information 
of the expected and the received subcarrier sequence. Apart from reconstructing the 
original symbols, a variety of applications can exploit such detailed information. In 
addition to vehicle classification [732], localization and tracking of people behind walls 
and doors [9], as well as privacy-preserving monitoring by healthcare applications [316] 
are possible. Another technology, UWB, is predestined for the precise measurement of 
a radio channel because of its high robustness against interference due to its support 
for large channel bandwidths and its ability to determine accurate channel impulse 
responses. Although the primary use of UWB is in the area of localization—and recently 
also as an additional security measure for radio keys—it can also be used for activ- 
ity detection [599] and vehicle detection classification [251]. Radio-based approaches 
assume that different vehicle types, due to specific shapes and installed materials, 
characteristically attenuate a radio signal. These attenuation patterns—symbolically 
referred to as fingerprints—can distinguish between different vehicle classes. Several 
radio technologies such as Bluetooth [58] or IEEE 802.15.4-based variants [250, 252] are 
suitable for radio-based methods, provided that the transceiver modules allow access 
to specific indicators of signal strength. A common approach is to use the RSSI, which 
is a coarse measure for assessing the received signal strength and depends heavily on 
the Signal-to-Noise Ratio (SNR) of the radio signal. Since these systems operate in the 
2.4 GHz radio range, they exhibit high robustness to poor weather conditions due to 
rain and snowfall [150, 522]. 


4.2.3 Radio Fingerprinting-Based Vehicle Detection and Classification 


This section describes the two variants of the proposed radio-based systems for vehicle 
detection and classification, including all essential components. Although both systems 
follow similar approaches with the evaluation of radio fingerprints , there are differences 
concerning the hardware components and the data processing, which we discuss 
in separate sections. First, the original system, which evaluates the signal strength 
information (RSSI) of multiple transverse and diagonal radio links, is discussed in 
detail (cf. Section 4.2.3.1). Subsequently, we highlight the significant differences and 
innovations of the current system approach that leverages high-precision WLAN CSI 
and UWB CIR channel information in Section 4.2.3.2. 
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4.2.3.1 RSSI-Based Vehicle Detection and Classification 

The system setup initially used for the detection of wrong-way drivers [250], consisting 
of a total of six radio nodes integrated into delineators—three transmitter and three re- 
ceiver units each—is shown in Figure 4.3. The system setup uses a constant longitudinal 
spacing of Alon = 5m. 
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Fig. 4.3: Schematic system overview. Each delineator post contains an RF transceiver module. In 
total, the system uses nine different radio links ©[2020] IEEE. Reprinted, with permission, from 
[634]. 


All nodes use low-cost, off-the-shelf MCU with IEEE 802.15.4 radio modules equipped 
with omnidirectional antennas and operate with a transmit power of 2.5 dBm in the 
2.4 GHz frequency band. 

For continuously measuring the RSSI of all radio links, the corresponding transmit- 
ter modules periodically transmit pseudo data every 8ms. The system uses a coordinated 
channel access scheme utilizing tokens to avoid interference between the radio links. 
Then the receiving nodes send the signal strength information they measure to the 
master gateway, which aggregates the raw data and synchronizes it for further process- 
ing. Figure 4.4 illustrates an example of time-varying radio fingerprints gathered for all 
radio links for a passing car (top) and a truck (bottom), respectively. 
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Fig. 4.4: Example multi-dimensional radio fingerprints for a passenger car and a truck. The colorized 
signals refer to the transverse radio links; the gray signals correspond to diagonal ones ©[2020] 
IEEE. Reprinted, with permission, from [634]. 


Figure 4.5 illustrates the entire data processing process for the RSSI-based classification 
system. First, the RSSI time series of all nine radio links ®; are recorded as vehicles pass 
through. Our approach then forwards the time signals to the data preprocessing block 
consisting of filtering using a moving average filter and subsequent normalization. 
These steps are relevant for minimizing the influence of scattered outliers—e. g., multi- 
path effects—and enabling high compatibility with various machine learning methods 
(feature scaling). Another process block realizes the detection of relevant subsets from 
the preprocessed time series. The system uses an automated thresholding approach to 
determine the individual start point tstart and endpoint tenq for each time series. 
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Fig. 4.5: System architecture model and data preprocessing pipeline used for ML-based vehicle 
classification using radio fingerprints ©[2020] IEEE. Reprinted, with permission, from [634]. 


The sequences tailored in this way serve as input for the subsequent process steps. 
The driving speed estimation serves as an additional feature for the classification 
process. With the help of the known longitudinal distance A,,,, between the individual 
delineators, the system can estimate the average speed 7 of the passing vehicles utilizing 
the temporal difference of the attenuation of the transverse links ®1, ®5, and Do using 
the following equation: 


n we , 44.9) , D) 


4.9 

3 (1,5) At(1,9) At(5,9) eea) 
where At(i, j) = tstart(j) — tstart(Ì and d(i, j) is the longitudinal distance between the 
traversal radio links ®(i) and ®(j). Negative velocities 7 < O refer to an opposite direc- 
tion, which indicates a wrong-way driver in the case of one-way streets. Similarly, we 
use Equation 4.10 to determine an approximation of the vehicle length: 


T= (qa) + 1(5) + 109) (4.10) 
where T(i) = teng(i) — tstart(i) denotes the duration of the attenuation of radio link @(i). 
The system also considers 90 different indicators—ten features for each of the nine radio 
links. These represent statistical variables such as mean value, standard deviation, 
minimum or maximum. In this way, a dimensional reduction is performed, since instead 
of several hundred signal strength values, the system only needs to process ten features 


per radio link. 


4.2.3.2 Using CSI and CIR Data for Vehicle Detection and Classification 

Like the previously presented system, the current modular system proposal also relies 
on the assumption that it is possible to reliably distinguish different road users by 
analyzing the characteristic interference they induce to a radio channel. Figure 4.6 
illustrates the novel system approach’s structure, which uses the radio technologies 
WiFi 4 (IEEE 802.11n) and Ultra-Wideband (IEEE 802.15.4a). In addition to comparatively 
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coarse signal strength information, these technologies also measure a wealth of exact 
channel parameters. 
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Fig. 4.6: Schematic system overview of the novel system approach leveraging WLAN CSI and UWB 
CIR data for bicycle detection (left) and motorized vehicle classification (right). 


For measuring WLAN CSI, the system uses MCUs based on Espressif ESP32 with WLAN 
transceivers connected to directional antennas operating with a transmit power of 
20 dBm in the 2.4 GHz frequency band. For a continuous sampling of the radio channel, 
high-rate dummy packets are exchanged between the respective transmitting and 
receiving nodes. Each received packet contains CSI information for channel estimation. 
To reduce protocol overhead and thus increase overall system performance, the system 
uses unidirectional User Datagram Protocol (UDP) data transmissions. Thanks to an 
Application Programming Interface (API), the MCUs allow the accessing of CSI and thus 
amplitude and phase information from various subcarriers. In general, the CSI can 
contain other fields than Legacy Long Training Field (LLTF) such as High Troughput 
Long Training Field (HT-LTF) or Space-Time Block Code High Throughput Long Training 
Field (STBC-HT-LTF), which depends on the supported transmission modes of all WLAN 
modules involved as well as the channel characteristics. The system currently uses 
WLAN nodes only for high-rate measurement; the data preparation and ML steps have 
so far only been performed on more powerful computers. 

UWB can accurately determine a radio channel’s channel impulse responses thanks 
to very short signal pulses, allowing further insights regarding a radio channel, e. g., 
whether a line-of-sight (LOS) path is available or how many significant signal paths exist. 
The presented system setup uses a custom-made Printed Circuit Board (PCB), combining 
a Decawave DWM1000 UWB transceiver module and an ARM Cortex M3 MCU [687]. 
Like the WLAN nodes, the systems currently uses the UWB nodes to measure channel 
impulse responses. This high-resolution channel data is continuously transferred to 
computers for further processing via USB. Figure 4.7 demonstrates example WLAN CSI 
and UWB CIR traces. 
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Fig. 4.7: Example WLAN CSI and UWB CIR traces. Each colorized line indicates a complete measure- 
ment sample including either multiple subcarriers (WLAN CSI) or CIR buffer sample data (UWB CIR). 


4.2.4 Evaluation Methodology 


This section presents the methodology used to evaluate both system variants. In this 
respect, we explain the system setups used for the field measurements, including 
essential parameters, the taxonomies adopted for the classification, and the ML models 
for performance evaluation. 


4.2.4.1 Field Measurements 
For data acquisition, we installed live systems in different environments for both system 
variants. The original RSSI-based classification system was installed and tested at a rest 
area on the A9 Autobahn as part of an official test site of the German Federal Ministry 
of Transport and Digital Infrastructure (shown in Figure 4.8, right). In total, the traces 
of 2605 vehicles were recorded and then manually labeled using camera images. The 
main parameters of the RSSI-based system can be found in Table 4.1. 

The novel system proposal, which also uses high-resolution WLAN CSI and UWB 
CIR channel data, was tested at two locations. Traces of cyclists were recorded at a cycle 
path (Figure 4.8, left), while those of motorized vehicles, especially those similar to 
passenger cars, were collected at a busy single-lane road (Figure 4.8, right). Thus, the 
latter setting is similar to that used for the evaluation of the RSSI-based predecessor 
system. Table 4.1 lists the essential parameters for the novel system proposal. 
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Tab. 4.1: System parameters of the original RSSI-based system approach and the system evolution 
using WLAN CSI and UWB CIR. 


Parameter Radio Technology 
WLAN CSI UWB RSSI 

Transmission power 20 dBm -14.31 dBm 2.5 dBm 
Operating frequency 2.4 GHz 6.5 GHz 2.4 GHz 
Sampling frequency 80 Hz 40 Hz 125 Hz 

Antenna type directional omnidirectional omnidirectional 

Antenna gain 5-7 dBi = = 
Number of radio links 1 1 9 
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Fig. 4.8: Experimental live deployments of the original RSSI-based system approach (right) and the 
novel CSI- and CIR-based system evolution (left, middle) on three different settings for gathering 
real-world vehicle traces. 


We used multiple taxonomies for the ML-based performance evaluation of the presented 
system variants, illustrated by Figure 4.9 for both the original RSSI-based (left) and 
the novel classification system (right). Defining different taxonomies was necessary 
because we tested both systems at diverse locations characterized by divergent traffic 
flows. Specifically, we evaluated the performance of the systems using taxonomies of 
varying complexity, which we briefly explain in the following: 
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Fig. 4.9: Overview of the vehicle classes and sample counts for the different taxonomies used for 
evaluating the RSSI-based system approach (left) and the novel system design (right) ©[2020] IEEE. 
Reprinted, with permission, from [634]. 


Binary This category distinguishes between car-like and truck-like or non-car-like sub- 
classes. While we classified car-like and truck-like vehicles for the RSSI-based 
system, we investigated the detection accuracy of the novel system with regard to 
cyclists using a binary classification with traces of cyclists as well as LOS (idle). No 
object was in the system during the LOS measurements, so fingerprints of the LOS 
radio channel were measured. 

Cyclist vs. Motorized Vehicles Because the dataset of traces for different road users 
was not large enough, we performed the detection and classification of three classes: 
car-like, bicycle (non-car-like), and idle. 

Size-based This was a 3-type classification of vehicles by vehicle length (only for the 
RSSI-based system). 

Body style-based Here we use a fine-grained classification of seven vehicle types 
(only for the RSSI-based system). 


In the body style-based taxonomy, the fine-grained classification task results in an 
increased overlap of vehicle classes with similar shapes (e. g., bus and semi-truck), 
increasing the overall classification inaccuracy. Nevertheless, we considered this com- 
plex taxonomy for the performance evaluation to illustrate the RSSI-based system 
approach’s strengths and limitations. 


4.2.4.2 Machine Learning-Based Classification 

We have used several established and state-of-the-art machine learning models to detect 
and classify vehicles, which we compare and explain below. The following models were 
used to evaluate the performance of both system variants: 
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Artificial Neural Networks (ANNs) Inspired by the human nervous system, ANNs 
have received keen attention for different scientific applications in the context of 
deep learning. From an implementation perspective, these models are realized by 
multiple matrix multiplications directly affecting its resulting memory footprint. 

Random Forests (RFs) Typical representatives of ensemble learning methods are RFs, 
which leverage the fact that most instances are assumed to be correct (wisdom of 
the crowd). Random subsets of features and training data are used for training each 
tree incorporated in an RF. Thanks to their binary decision-making, RFs allow for a 
resource-efficient implementation using simple if/else statements. By adjusting 
parameters such as limiting the number of allowed trees or the maximum depth for 
all trees, both processor and memory utilization can be controlled conveniently. 

Support Vector Machines (SVMs) SVMs aim to separate data points in a multi- 
dimensional space through a hyperplane such that for each feature, the members 
of each class are separated as precisely as possible, which is achieved by minimiz- 
ing a specific objective function. 


In addition, we have used the following ML models for evaluating the performance of 

the original RSSI-based system approach: 

Deep Boltzmann Trees (DBTs) Belonging to deep learning models, DBTs benefit from 
the fact that users have neither to extract features nor define transformation func- 
tions because they automatically derive differentiable functions from the given 
dataset. As an inherent downside, DBTs also require the user to select proper hy- 
perparameters and a sufficient amount of training data due to the mass of trainable 
weights. 

Proximity Forests (PFs) Like RFs, PFs belong to ensemble learning models, but in- 
stead of CART trees, they utilize proximity trees. Proximity trees use associated 
data points from the training set and implement-as its name suggests—a proximity- 
based approach where an object follows the branch with the highest similarity 
regarding a parametrized similarity measure. 


4.2.5 Real-World Validation 


This section presents and discusses the results for both proposed vehicle classifi- 
cation system approaches. Because we have developed and tested both systems 
independently—i.e., in different locations with divergent road users—we cover the 
results in separate sections, starting with the original RSSI-based vehicle detection and 
classification system. 
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4.2.5.1 Radio-Based Detection and Classification System 

This subsection covers the results gained for the original RSSI-based system approach 
for vehicle detection and classification. At first, we describe how we have evaluated 
its detection performance, i.e., how accurately the system can determine a passing 
road user. To this end, we fed the raw traces of 2605 vehicles into a system-in-the-loop 
evaluation setup, allowing for flexible parameter tuning of the detection algorithm. 
Due to its relatively high system complexity in multiple diagonal and cross-radio links, 
the system also facilitates speed estimation and wrong-way driver detection (see Fig- 
ure 4.10). We simulated the latter task by virtually inverting the order of the radio links 
spanned between the different nodes. Accordingly, an estimated negative speed indi- 
cates a wrong-way driver. The histogram shows a noticeable distribution for the dataset, 
implying a sound detection of the driving direction for passing vehicles. Since the num- 
ber of detected vehicles matches captured vehicle traces, detection accuracy is 100 %. 
Nonetheless, we want to note that further real-world measurements are necessary to 
confirm the results of our virtual detection of wrong-way drivers. 
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Fig. 4.10: Histograms of the speed estimations for the real world data and the virtually inverted node 
sequence. The vehicle count matches the total of captured vehicles traces and all wrong-way drivers 
are detected ©[2020] IEEE. Reprinted, with permission, from [634]. 


Next, we want to provide and discuss the results of the vehicle classification. We have 
utilized 10-fold cross-validation with 1/9 data split in each fold, i. e., 90 % of the data is 
used for training and the remaining 10 % for testing. After ten iterations, the statistical 
deviations of those folds are derived and used for performance evaluation. Figure 4.11 
illustrates the classification accuracies for different machine learning models and 
the considered vehicle taxonomies (cf. Figure4.9). The 99 % classification accuracy, 
a typical minimum requirement for some applications, is illustrated as a horizontal 
dashed line. 
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Fig. 4.11: Comparison of the overall classification accuracies for the considered machine learning 
models and considered vehicle taxonomies ©[2020] IEEE. Reprinted, with permission, from [634]. 


All evaluated models can exceed the given 99 % threshold for some runs but otherwise 
fall below it for the binary taxonomy. Only the SVM achieves a mean accuracy that 
matches this threshold. For the more complex vehicle taxonomies, the overall accu- 
racies of all models decrease significantly: 93 % to 98 % for the size-based taxonomy 
and 90 % to 95 % concerning the fine-grained task. The apparent deviations between 
the models’ performances result from their different learning strategies. While the 
stochastic nature of RF induces more significant standard errors in cross-validation, the 
DBT obtains lower performance levels than the remaining models because it calculates 
a probability measure for the given data. 

Finally, the class-specific classification accuracy for the three considered vehicle 
taxonomies, i. e., binary, size-based, and fine-grained, is depicted in Figures 4.12, 4.13, 
4.14. Starting with the binary taxonomy, which is the most simple classification task 
differentiating car-like and truck-like vehicles, the main challenge for all models is to 
classify truck-like cars correctly. We can validate this assumption by interpreting the 
classification results for mid-sized vehicles, as shown in Figure 4.13: all models have 
similar standard error values, whereas they perform notably better for small- and large- 
sized cars. For the fine-grained taxonomy, the multitude of similarly shaped vehicles 
and the underrepresentation of traces for specific vehicle types (e. g., bus) leads to 
lower classification accuracies due to larger standard deviations. 


4.2.5.2 Vehicle Classification Using WLAN CSI and UWB CIR 

For the new modular classification system, we present and discuss the classification 
results in this subsection. As previously mentioned, we have conducted multiple mea- 
surement campaigns for gathering traces of both cyclists and motorized vehicles. Be- 
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Fig. 4.12: Normalized confusion matrices for binary vehicle classification. 
C: car-like, T: truck-like ©[2020] IEEE. Reprinted, with permission, from [634]. 
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Fig. 4.13: Normalized confusion matrices for size-based vehicle classification. S: small, M: medium, 
L: large ©[2020] IEEE. Reprinted, with permission, from [634]. 


cause our measurements have focussed on capturing traces induced by cyclists, we 
have performed most of the analysis on detecting these. We start discussing the re- 
sults for cyclist detection. Then we provide the performance results for a multi-type 
classification task with regard to cyclists and different motorized vehicle types. 

For evaluating the bicycle detection performance, we have considered a binary 
classification task with the classes bicycle and non-bicycle (idle). Table 4.2 lists the 
maximum classification results achieved for separately analyzing different channel 
parameters gathered from WLAN CSI and UWB CIR data using ANN, RF, and SVM. 
Regarding WLAN CSI, the RSSI is the dominant channel indicator leading to the high- 
est classification accuracy for all models. A possible explanation is that the WLAN 
transceiver modules evaluate multiple channel parameters to extract a significant 
measure for the link quality. Similarly, there is also a single channel parameter for 
UWB-—the quotient of the estimated first path signal power and the channel impulse 
response power—leading to the highest classification accuracies. In particular, using 
this quotient and ANN, we could reach 100 % accuracy for detecting cyclists. 

Next, we present the results for the multi-type vehicle classification applied for 
cyclists and different motorized vehicles. Specifically, we have conducted this task for 
a total of three classes, i.e., idle, bicycle, and car-like vehicles. Table 4.3 shows the 
classification results for WLAN CSI and UWB CIR data using ANN, RF, and SVM. Contrary 
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Fig. 4.14: Normalized confusion matrices for body style vehicle classification. PC: passenger car, 
PCT: passenger car with trailer, V: van, T: truck, TT: truck with trailer, ST: semitruck, B: bus ©[2020] 
IEEE. Reprinted, with permission, from [634]. 


Tab. 4.2: Bicycle Detection: Results for WLAN CSI and UWB using ANN, RF, and SVM with a 10-fold CV. 


WLAN CSI UWB 


Model c 
iis Score Value [%] Param. Value [%] Param. 


Accuracy 99.27+0.57 R (f2) 100+0 FC (f0) 

ANN Precision 99.3540.52 R (f2) 100+0 FC (f0) 
Recall 99.24+0.61 R (f2) 100+0 FC (f0) 

F-Score 99.3040.56 R (f2) 100+0 FC (f0) 
Accuracy 99.45+0.54 R (f0) 99.83+0.26 FC (f1) 

RF Precision 99.48+0.52 R (f0) 99.84+0.25 FC (f1) 
Recall 99.45+0.51 R (f0) 99.8+0.26 FC (f1) 

F-Score 99.46+0.51 R (f0) 99.83+0.26 FC (f1) 
Accuracy 99.32+0.51 R (f2) 99.83+0.26 FC (f0) 

SVM Precision 99.3840.47 R (f2) 99.84+0.24 FC (f0) 


Recall 99.3040.53 R (f2) 99.82+0.27 FC (f0) 
F-Score 99.3440.50 R (f2) 99.83+0.26 FC (f0) 


f: Filter size, FC: Ratio of first path signal power and CIR power, R: RSSI 


to the cyclist detection task, there are multiple predominant channel indicators for 
each system. 

For WLAN CSI, the RSSI seems to be inadequate for achieving the highest accuracy. 
Instead, the subcarrier’s amplitude values of different training fields are more relevant 
for this task: LLTF when using ANN, STBC-HT-LTF for RF, and several subcarriers in 
the case of SVM. By contrast, there are two crucial parameters when using UWB: the 
amplitudes of all raw CIR accumulator data (A) and the amplitudes of accumulator 
sample 15 (A15). By comparing the resulting classification accuracies for WLAN CSI 
and UWB CIR data, we can identify a considerable performance gap of about 5 % for 
the benefit of the former radio technology. However, we note that we have gathered 
the traces for cyclists and motorized vehicles in different environments with diverging 
system dimensions, impacting the transmissions between the UWB transceiver modules 
equipped with internal PCB antennas. 
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Tab. 4.3: Multi-type vehicle classification: Results for WLAN CSI and UWB using ANN, RF, and SVM 
with a 10-fold cross-validation. 


WLAN CSI UWB 


Model 
one Score Value [%] Param. Value [%] Param. 


Accuracy 98.23+0.67 L (f4) 92.38+1.30 A (f2) 
Precision 98.52+0.49 L (f5) 93.53+1.46 A (f2) 


ANN Recall 98.31+0.63 L (f4) 93.30+1.34 A (f2) 
F-Score 98.39+0.71 L (f3) 93.41+1.38 A (f2) 
Accuracy 98.67+0.62 S (f0) 92.96+1.67 A (f0) 

SF Precision 98.83+0.59 S (f0) 93.74+1.74 A (f2) 
Recall 98.84+0.60 S (f1) 93.28+1.79 A (f2) 
F-Score 98.8+0.61 S (f0) 93.51+1.75 A (f2) 
Accuracy 96.95+1.66 Hscsz (f0) 91.17+2.03 Aus (f0) 
Si Precision 97.8641.24 Hscs2 (f0) 92.1341.85 As (f0) 


Recall 97.46+0.43 L (f4) 90.48+2.74  Aıs (f0) 
F-Score 97.39+1.44  Hscs2 (f0) 91.29+2.25 A;s5 (fO) 


A: Amplitudes of all CIR accumulator samples, A15: Amplitudes of CIR accumulator sample 15, f: 
Filter size, Hscs52: HT-LTF sub-carrier 52 amplitudes, L: LLTF sub-carrier amplitudes, S: STBC-HT-LTF 
sub-carrier amplitudes 


4.2.6 Conclusion 


This section presented two variants of novel radio-based systems that exploit different 
indicators of radio channels for accurate vehicle detection and classification. While the 
original system approach leverages relatively rough attenuation patterns of wireless 
signals (RSSI fingerprints), the evolved modular system approach uses exact channel 
parameters provided by the radio technologies WLAN CSI and UWB. Compared with 
existing detection and classification solutions, the proposed system variants are privacy- 
preserving, robust against challenging weather conditions, accurate, and cost-efficient. 
We have analyzed the suitability of both systems in comprehensive measurement cam- 
paigns in different environments: on a rest area on a highway, a busy one-lane road ina 
rural setting, and on a cycle path. The presented results approve the high performance 
of those system approaches for a set of differently challenging applications ranging 
from simple detection tasks of road users to a fine-grained classification of multiple 
vehicle types. 

In future work, we want to investigate the applicability of different radio tech- 
nologies (e. g., mmWave) within our detection and classification system. Moreover, 
we will obtain additional vehicle traces in challenging urban environments (e. g., ina 
downtown setting) to evaluate and strengthen system performance. 
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4.3 Green Networking and Resource Constrained Clients for Smart 
Cities 


Pascal Jorke, Christian Wietfeld 


Abstract: The Internet of Things (IoT) will enable a variety of new use cases by link- 
ing billions of IoT devices. Introducing new use cases each day, IoT devices will be 
found everywhere in the future. With a new generation of resource-constrained clients, 
communication networks have to face new challenges such as high communication 
ranges, small data transmission efficiency, and large scalability. With the Narrowband 
Internet of Things (NB-IoT) and enhanced Machine Type Communication (eMTC), cel- 
lular communication solutions have been adapted to these new challenges. Including 
mechanisms for larger communication ranges as well as lower power consumption, 
NB-IoT and eMTC aim to fulfil the requirements defined by new massive Machine Type 
Communication (mMTC) use cases. While performance is often only optimized on the 
lower layers, upper layers including transmission and application protocols need to 
be addressed by reducing overhead and enabling efficient small data transmissions in 
order to deliver good performance for resource-constrained clients. 


This section describes the achievements in evaluating the performance of Low-Power 
Wide-Area Network (LPWAN) solutions for resource-constrained clients in terms of 
energy efficiency, spectral efficiency, and latency. Therefore, new cellular IoT features 
for power saving and coverage extension are explained in detail, while taking the costs 
for the scalability of the networks into account. With this knowledge, a performance 
analysis of resource-constrained LPWAN clients with different coverage conditions is 
provided. 


Although both NB-IoT and eMTC use the same power-saving techniques as well 
as repetitions to extend the communication range, the analysis reveals a different 
performance in the context of data size, rate, and coupling loss. While eMTC comes with 
a 4% better battery lifetime than NB-IoT when considering 144 dB coupling loss, the 
NB-IloT battery lifetime has 18 % better performance in 164 dB coupling loss scenarios. 
The overall analysis shows that in coverage areas with a coupling loss of 155 dB or 
less, eMTC performs better, but requires much more bandwidth. Taking the spectral 
efficiency into account, NB-IoT is in all evaluated scenarios the better choice and more 
suitable for future networks where the number of devices connected is expected to be 
close to or go beyond the network capacity. 


While communication is possible with coupling losses up to 164 dB, the results show 
that the overall performance is very limited with decreasing signal quality. Although be- 
ing designed for extended coverage, the mobile network operators should continuously 
improve the signal quality for both uplink and downlink directions. When increasing 
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the number of base stations is not feasible, alternative signal quality improvement solu- 
tions should be addressed. In this context, the coverage and link quality improvement 
of cellular IoT networks with multi-operator and multi-link strategies was evaluated 
as a case study, using the smart city Dortmund, Germany. The results show that the 
link quality can be improved by up to 13.6 dB, which enables shorter time-on-air for 
resource-constrained devices and thus drastically improves the energy and spectral 
efficiency. 


4.3.1 Introduction 


Waste bins with fill-level sensors, distributed environmental sensors monitoring the 
overall air quality in large cities, and beehive sensors regulating the hive temperature 
and transmitting the hive weight are just some use cases that integrate small sensor 
devices. The IoT enables countless new use cases. While some sensors have fixed power 
sources, others need to be independent of fixed power sources (e.g. smart waste bins) 
and therefore must rely on batteries, or, for an even better battery life, energy harvesting 
[262]. In large-scale scenarios, such as smart waste management, the operational costs 
need to be as low as possible and therefore the clients have to rely on a single battery for 
years, but still provide large communication ranges. In the past few years, many new 
communication solutions have addressed the requirements of low power consumption 
and wide area communication and are therefore called Low-Power Wide-Area Networks 
(LPWANs). A promising solutions is Long-Range Wide-Area Network (LoRaWAN), which 
is used by many public utilities, because it broadcasts in the license-free spectrum 
and is easy to set up. An alternative in licensed frequency bands is NB-IoT and eMTC, 
which were derived from the LTE communication technology. Both NB-IoT and eMTC 
can be deployed in existing LTE networks and therefore provide a fast and easy rollout 
in many countries. 

The next section will give a short overview of the relevant characteristics when 
considering solutions in the license-free and licensed spectrums. 


Clients in the License-Free Spectrum LoRaWAN is an easy-to-use communication 
solution for IoT. Designed for small data transmissions, LORaWAN uses a proprietary 
communication protocol with small overhead and short time-on-air. To further reduce 
overhead and power consumption, the channel access is based on unslotted Aloha, 
directly transmitting data when available. With no channel access mechanism and de- 
ployment in the license-free spectrum, collisions are inevitable [93], making LORaWAN 
unreliable in large scaled networks. Therefore, LoORaWAN is well suited for use cases 
with minimum Quality of Service (QoS) requirements, where the loss of packets is ac- 
ceptable. Additionally, duty-cycle limitations need to be taken into account. Duty-cycle 
limitations are used in license-free spectrum to restrict the maximum transmit time 
of a device, e.g. 1%, which means that a device can only transmit data for 36s each 
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hour, affecting the maximum transmit interval. More details on clients in license-free 
spectrum can be found in Section 5.1. 


Clients the in Licensed Spectrum When communication needs to be more reliable, 
communication solutions in licensed spectrum are the better choice. Derived from LTE, 
NB-IoT and eMTC use central scheduling for available frequency and time resources. 
Therefore, collisions on the air interface are prevented and the scalability of the net- 
work itself is mostly limited by the given frequency bandwidth. The price to pay for 
a scheduled transmission is the increased overhead for synchronization and control 
traffic, which affects the spectral and energy efficiency of resource-constrained clients. 


4.3.2 Design Objectives of Resource-Constrained loT Clients 


With an exclusive spectrum available, NB-IoT and eMTC (often summarized as cellular 
IoT) are not limited by duty cycles and can be used with various application protocols 
and in many use cases. While eMTC relies on an existing LTE network and reuses LTE 
synchronization, NB-IoT can also be deployed as a stand-alone network [398]. Since 
NB-IoT uses only 180 kHz bandwidth (compared with 1.08 MHz in eMTC), it can also 
be deployed in guard bands-the unused bandwidth between two LTE networks—and 
usually used avoid interference. 

Since clients may be distributed over a large area or even in challenging commu- 
nication environments such as basements, cellular IoT solutions have to provide an 
extended network coverage, while still enabling low power consumption. The following 
section will give a short overview on power-saving and range-extending mechanisms 
that are introduced by cellular IoT. 


4.3.2.1 Low Power Consumption 

With current cellular communication solutions, the battery lifetime is often limited 
to a maximum of several weeks. By contrast, cellular IoT has the design objective of 
10 years on a single 5 Wh battery [5]. Most IoT devices are designed to transmit small 
amounts of data on a hourly, daily, or weekly basis, which means the device is mostly 
in an idle state. Therefore, the new mechanisms for low power consumption focus on 
energy efficiency in idle mode. 

Figure 4.15 depicts a typical NB-IoT transmission cycle. In the connected state, the 
device transmits its data and waits for a response. When no more data is transmitted or 
received, the device enters Extended Discontinuous Reception (eDRX). The eDRX mode 
extends the DRX (Discontinuous Reception) cycle to allow a device to remain longer in 
a power-saving state between paging occasions [398] and thus to further reduce the 
power consumption. The device remains synchronized and periodically available for 
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Fig. 4.15: NB-loT transmission cycle with eDRX and PSM. ©[2018] IEEE. Reprinted, with permission, 
from [305]. 


mobile-terminated services. When the eDRX timer T3324 expires, the device will switch 
into the Power Saving Mode (PSM). 

When using PSM, the device enters a power-saving state in which it reduces its 
power consumption to a bare minimum [398]. In PSM, the device remains registered to 
the network and maintains its connection configurations. As soon as the device leaves 
PSM, it does not need to attach to the network; rather, it reestablishes the previous 
connection, which leads to a reduced signaling overhead and optimized device power 
consumption. However, the device is unreachable for the network as long as it remains 
in PSM because it does not listen to the paging time windows. Mobile terminated 
services have to be suspended until the device reconnects to the network for mobile 
originated events. Tracking Area Updates (TAU) also trigger the device to end PSM and 
reestablish the connection to the network. While performing a TAU, the device listens 
to paging time windows and queued downlink transmissions. 


4.3.2.2 Extended Coverage 

For a comparism of signal ranges, the Maximum Coupling Loss (MCL) is often used, 
since it defines the maximum signal attenuation at which the receiver is still able to 
decode the signal. While the eMTC design objective defines an MCL of 155.7 dB [4], 
NB-IoT aims to extend the MCL to 164 dB. Figure 4.16 provides corresponding basement 
penetration ranges for the different MCL objectives. 

Besides small bandwidths, cellular IoT solutions use repetitions for an increased 
energy per bit and therefore improved signal decoding. Therefore, eMTC introduces 
Coverage Enhancement (CE) Modes A and B. CE Mode A is mandatory and supports 
up to 32 repetitions while CE Mode B is optional and defines up to 2048 repetitions. 
NB-IoT also supports up to 2048 repetitions, though it does not divide the number 
of repetitions in different CE Modes, making all repetition options mandatory to all 
devices. Table 4.4 gives a detailed overview of the maximum number of repetitions for 
each NB-IoT and eMTC signal [398]. 

As similar data is transmitted, the application data rate decreases drastically with 
each repetition and devices consume more power compared with a transmission with- 
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Fig. 4.16: Coverage Enhancement in NB-loT and eMTC 


Tab. 4.4: Maximum number of repetitions in eMTC and NB-loT. 


eMTC Max. repetitions NB-loT Max. repetitions 
CE ModeA CE Mode B 

PDSCH 32 2048 NPDSCH 2048 

MPDCCH 16 256 NPDCCH 2048 

PRACH 32 128 NPRACH 128 

PUSCH 32 2048 NPUSCH 128 

PUCCH 8 32 


out repetitions. While new mechanisms like eDRX and PSM aim to increase the energy 
efficiency and battery lifetimes of cellular IoT devices, the extended coverage will lead 
to a significant reduction of energy efficiency. To ensure that the cellular IoT design 
objective is still achievable, both NB-IoT and eMTC must be subjected to a performance 
analysis. 


4.3.2.3 Application Protocols for loT 

While LPWAN solutions in a license-free spectrum such as LoORaWAN often lack end- 
to-end Internet Protocol (IP) support due to the large protocol overhead, both NB-IoT 
and eMTC are able to transmit IP traffic such Transmission Control Protocol (TCP) and 
User Datagram Protocol (UDP) messages. Due to the reduced transmission capacity, 
the packet size and number of message sequences should be as low as possible even 
with the coverage extension mechanism. For a decent system performance, the choice 
of a suitable application protocol is essential. 

Message Queuing Telemetry Transport (MQTT) is a TCP-based IoT communications 
protocol, designed for Machine to Machine (M2M) data transmissions in low bandwidth 
environments [41]. It uses a centralized broker to which clients can publish data, while 
other clients can subscribe to data updates. In addition to transmission protection 
through TCP, MQTT introduces three Quality of Service (QoS) levels. QoS level 0 is a 
simple, low-overhead method of sending a message. The client simply connects to 
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the broker and publishes the message without MQTT acknowledgement by the broker. 
The data is still acknowledged on lower layers by TCP. QoS level 1 guarantees that 
the message is transmitted successfully to the broker. The broker sends an acknowl- 
edgement back to the sender. In case of a transmission loss of either the data or the 
acknowledgement, the data is retransmitted by the sender. QoS level 2 is the highest 
level of service. It comprises a sequence of four messages between the sender and the 
broker, and a handshake to confirm that the main message has been sent and that the 
acknowledgment has been received. When the handshake is completed, both sender 
and receiver are sure that the message was received exactly once, though this approach 
creates the highest message overhead. 

MOTT for Sensor Networks (MQTT-SN) is an UDP-based, optimized version of the 
IoT communications protocol MQTT, designed specifically for efficient operation in 
large, low-power IoT sensor networks [645]. Like MQTT, it uses a centralized broker. 
Besides QoS levels O to 2, a new QoS level is introduced, which further reduces the 
message overhead. Publishing messages with a QoS level of -1 doesn’t require an initial 
connection setup and broker registration; rather, it only transmits a single publish 
message with all required data. While QoS -1 comes with the lowest overhead, it does 
not provide acknowledgements and other responses from the broker. 

Constrained Application Protocol (CoAP) is a third important IoT application pro- 
tocol. CoAP is an UDP-based, specialized web transfer protocol for use with resource- 
constrained nodes and constrained networks in the IoT [600]. The protocol is designed 
for M2M applications. Unlike MQTT and MQTT-SN, it transmits data directly between a 
client and a server. Since no connection needs to be established at first, CoAP comes 
with a low message overhead. When using confirmed transmissions, all data that is 
transmitted between server and client is confirmed by application acknowledgements. 
The data is re-sent until it is acknowledged or the maximum number of retries is reached. 

When high QoS is required, MOTT or MQTT-SN should be used for data transmis- 
sion, since both protocols provide multiple layers of a protected transmission. When 
energy efficiency is essential and data are transmitted only from point to point, CoAP 
is the better choice. Figure 4.17 gives an overview of message overheads for these IoT 
application protocols. 

For the highest energy efficiency, the number of transmitted messages should be 
as low as possible. When no application acknowledgement is required, both CoAP 
Non-Confirmable and MQTT-SN QoS -1 are applicable. In case of acknowledgement is 
needed on the application layer, CoAP is the most energy efficient choice since both 
MQTT and MOQTT-SN require a previous connection setup, before transmitting and 
acknowledging user data. 
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Fig. 4.17: Comparison of message sequences for different loT application protocols. 


4.3.3 Performance Analysis of Resource-Constrained LPWAN Clients 


While features like eDRX and PSM aim for lower power consumption and thus longer 
battery lifetime, coverage enhancement provides a much higher power consumption 
for UEs by introducing longer transmission and reception intervals. Both features 
are required to fulfill the challenges of 10-years battery lifetimes, 164 dB maximum 
coupling loss, and a maximum latency for a single data transmission of 10 seconds. In 
this section, performance studies of cellular IoT solutions will be analyzed. 


4.3.3.1 Power Consumption Analysis of NB-loT and eMTC in Challenging Smart-City 
Environments 
In Section 4.3.2.1, two new power-saving states for NB-IoT and eMTC devices are intro- 
duced. Both states provide a reduced power consumption compared with current GSM 
or LTE devices. Besides PSM and eDRX, devices can enter three additional power states: 
Connected, Tail, and TAU. In the Connected power state, random access, data transmis- 
sion, and reception are performed. After data transmission and reception, the device 
remains for a predefined time in a Tail state, also called a data inactivity timer, where 
it remains connected on the RRC communcation layer for additional data exchange. 
Then, it switches to the eDRX power state, where the device wakes up only for paging 
occasions. Finally, the device reduces its power consumption to a bare minimum in 
the PSM state. It periodically wakes up for TAU and checks if downlink transmissions 
are queued, since these messages can’t be received while in PSM. Figure 4.18 gives an 
overview of the state machine that is used to determine the energy consumption of UEs. 
To compare different cellular IoT solutions and assess if all IoT requirements can 
be fulfilled by NB-IoT and eMTC, the authors in [305] provide a performance analysis of 
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Fig. 4.18: State machine for power consumption evaluation. ©[2018] IEEE. Reprinted, with permis- 
sion, from [305]. 


NB-IoT and eMTC latency, data rate, battery lifetime, and spectral efficiency for three 
different coverage classes. The results are shown in Figure 4.19. 
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Fig. 4.19: Comparison of NB-loT and eMTC devices for 84 bytes of acknowledged uplink data every 
24 hours in different coverage conditions. Note that the axis scales vary between the three figures. 
©[2018] IEEE. Reprinted, with permission, from [305]. 


The results show that eMTC performs slightly better than NB-IoT at coupling losses of 
144 dB and 154 dB. While the eMTC performance gain is rather small, it uses 6 times 
more bandwidth than NB-IoT, making transmissions less spectral-efficient. When it 
comes to cell edges such as basements, where the coupling loss can increase to up 
to 164 dB, NB-IoT clearly outperforms eMTC. While both cellular technologies use the 
same power saving mechanisms, NB-IoT needs fewer repetitions to transmit data, which 
reduces the time on air and therefore extends the time in PSM between transmissions. 
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Although eMTC performs better than NB-IoT in good coverage conditions, the difference 
in data rate, latency, and battery lifetime performance is rather small. When it comes 
to poor coverage conditions as well as spectral efficiency, NB-IoT is recommended over 
eMTC. 


4.3.3.2 Coverage and Link Quality Improvement of Cellular loT Networks with 
Multi-Operator and Multi-Link Strategies 

Section 4.3.3.1 has given an overview of the performance of cellular IoT technologies in 
different coverage scenarios. With increasing signal attenuation the overall performance 
decreases drastically. Therefore, the signal quality should always be as good as possible. 
Instead of expanding the networks by installing new base stations, multi-operator 
strategies (such as National Roaming) can provide better coverage and link quality for 
LTE and cellular IoT technologies, by allowing cellular devices to use networks from 
different Mobile Network Operators (MNOs) as well. 

The authors in [306] evaluated the potential of coverage and link quality improve- 
ment in terms of multi-operator strategies in the Smart City Dortmund as a case study. 
By extracting the number and locations of all cellular base stations in Dortmund and 
applying empirical path loss models for urban environments, we performed a compre- 
hensive coverage analysis. The results are given in Figure 4.20. 
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Fig. 4.20: Results of the coverage analysis for outdoor, indoor and deep indoor scenarios and dif- 
ferent cellular communication technologies in an urban environment. ©[2019] IEEE. Reprinted, with 
permission, from [306]. 


While all MNOs can provide full LTE and NB-IoT coverage outdoors, indoor, and base- 
ment coverage, deep indoor coverage from LTE decreases to 66 % and 42 %, respectively. 
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In multi-operator scenarios, the deep indoor coverage increases by up to 40 %, which 
makes multi-operator deployment highly recommended in LTE scenarios. 

Due to its extended coverage, NB-IoT can provide full coverage in all scenarios 
when using a maximum coupling loss of 164 dB. If the maximum coupling loss is limited 
to 154 dB for a better performance (see Section 4.3.3.1), NB-IoT can still provide full 
coverage when multi-operator strategies are used. 

Even in scenarios with full coverage, multi-operator strategies can be reasonable 
by improving the average link quality by up to 13.6 dB (Figure 4.21). In Figure 4.19, 
a 10 dB improvement of link quality can already increase the battery lifetime of an 
NB-IoT device from 4 to 18 years and decrease the latency from 5s to 0.8 s. Therefore, 
multi-operator strategies are highly recommended for link-quality improvement. 


Average Signal Power Gain MNO 1&2 MNO 1&3 MNO 2&3 MNO 1-3 (National Roaming) 
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Fig. 4.21: Results of the coupling loss reduction potential for different coupling loss scenarios and 
cellular communication technologies. ©[2019] IEEE. Reprinted, with permission, from [306]. 


4.3.4 Conclusion 


Energy efficiency is an important factor in the IoT. Many use cases rely on sensors 
that can last at least 10 years on a single battery. With new communication technolo- 
gies such as NB-IoT and eMTC, cellular solutions respond to the new challenges that 
are introduced by the IoT. But optimized communication technologies alone are not 
sufficient. Energy efficiency needs to be addressed on all layers, from the choice of 
an appropriate application protocol that produces as low an overhead as possible to 
link-quality improvement strategies that obviate a high number of repetitions on the 
air interface. In view of the results of the performance and coverage analysis, a good 
device placement is of great importance for resource-constrained devices. In the future, 
extensive measurements of latency and energy consumption can be used to derive 
an ML-based predictive model for latency and energy performance by using passive 
measurement parameters such as signal strength and signal quality. Additionally, the 
number and size of transmitted messages can be reduced by ML-based model-predictive 
communication, as introduced in [30]. If green networking and resource constraints are 
taken into account from the very beginning of an application’s design, the performance 
of the system from a user and network perspective can be significantly increased. 
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Abstract: Vehicular communication is used to exchange safety-related status informa- 
tion, enable efficiency-oriented mobility planning, and share other user data, such 
as video streams, among a locally restricted area and the vehicles and nearby infras- 
tructure within it. Without the need for cellular coverage, the particular devices, or 
agents, organize themselves in a distributed fashion without a central coordination 
unit. This ability not only allows the realization of Intelligent Transportation Systems 
(ITSs) that will have a major impact on the cities of the future, but it also enables 
spontaneously deployed networks, that cover the task of on-demand network pro- 
visioning for events. A well-known example is the support of rescue units that can 
utilize Unmanned Aerial Vehicles (UAVs) for remote sensing and delegate exploration 
tasks and reduce the risk of endangering human personnel. These applications all 
have high requirements and need a robust and reliable communication behavior. As 
Mobile Ad-hoc NETworks (MANETs) are not managed centrally, data needs to be routed 
efficiently from the sender to the receiver, whereas link losses and unnecessary hops 
need to be avoided. Established protocols rely on simple distance measurements and 
try to minimize the sender-to-receiver distance. In challenging networks, these simple 
approaches can not cope with the complexity of the task. Therefore, more advanced 
techniques integrate more information and provide a higher grade of generalization. 
Comprehensive simulations have shown that the utilization of cross-layer knowledge 
and the prediction of future network states enable reliable and robust reinforcement 
learning-based routing algorithms, that achieve high performances under different 
conditions. Moreover, this technology outperforms established routing protocols by up 
to 51% in all considered studies. 


4.4.1 Introduction: Direct Agent Communication in Ad-Hoc Networks 


Self-organizing networks, where the nodes communicate directly, are described as 
ad-hoc networks. Routes are built not only to directly reachable neighbors, but also to 
more distinct nodes with which communication is only possible by invoking (multiple) 
intermediate hops. This way, all agents create a mesh. Here, the sub-class of Mobile Ad- 
hoc NETworks (MANETs) is of particular interest, as they explicitly specify the possible 
movement of nodes, which leads to frequent changes in the network topology and 
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thus requires suitable solutions for providing a robust and reliable communication. 
As MANETs are infrastructure-less, efficient routing protocols are required to cover 
a trade-off between overhead and the provision of valid routes. Whereas too much 
traffic used for coordination would reduce the overall throughput, outdated information 
leads to the loss of packets. The nodes’ mobility is a primary factor of impact on the 
network topology. MANETs can have a hybrid consistency of mobile nodes like those of 
UAVs and cars, but it also has nodes with very low or stationary mobility like those of 
pedestrians and Road-Side Units (RSUs). Besides mobility, varying channel conditions 
due to multi-path propagation, especially in urban environments, signal attenuation, 
and shadowing, have a significant impact on the node’s reachability and can harm 
the end-to-end routing performance. As these varieties of influxes offer challenging 
conditions for MANETs, established routing protocols that rely on considerably simple 
metrics, such as distance vectors represented by hop counts, are not able to fulfill the 
requirement for a robust and reliable communication. Therefore, the integration of 
further information and the enhancement of routing metrics are motivated in order to 
assess occurring network situations adequately and increase the overall performance. 


4.4.2 Related Work: Evolution of Mobility-Predictive Ad-Hoc Routing Protocols 


A classification of the developments of ad-hoc routing protocols is given in Figure 4.22. 
The proposed protocols originate from the Better Approach To Mobile Ad-hoc Network- 
ing (B.A.T.M.A.N.) [472] project located in the Freifunk community. B.A.T.M.A.N. III was 
originally developed to tackle scalability problems in the established routing protocol 
Optimized Link State Routing (OLSR) by distributing the topology knowledge among 
local entities and thus obviate the need to calculate the full network graph on every 
node, which is an expensive task and especially unsuitable for resource-constrained 
systems. Subsequent B.A.T.M.A.N. versions relocate their point of operation from In- 
ternet Protocol (IP)-based routing in layer 3 to layer 2 in order to provide an network 
protocol-independent routing approach and have a more direct impact on the packets. 
As B.A.T.M.A.N. is intended for real-use cases, kernel implementations are available, 
but, simulation models for scientific research are often omitted. Therefore, a simulation 
model of the B.A.T.M.A.N V protocol version for the well-known discrete event simula- 
tor Objective Modular NETwork testbed in C++ (OMNET++) [701] has been developed 
and validated by field experiments in [627]. However, as the overall goal of the Frei- 
funk community is to provide mesh-based Internet access within cities, B.A.T.M.A.N. 
implementations contain overhead to fulfill this task, such as Host Network Announce- 
ments (HNAs), is separate from the actual routing process. While the first extension 
B.A.T.Mobile [623] forks from the main branch and hauls those measures, the consecu- 
tive protocol Predictive Ad-hoc Routing fueled by Reinforcement learning and Trajectory 
knowledge (PARROT) [635] omits additional overhead for network provisioning. Thus, 
PARROT, which gained additional influences of reinforcement learning-based routing, 
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Fig. 4.22: Evolution graph of routing protocols. 


concentrates on IP-based routing for an assessment of the concepts. The de-capsulated 
development of novel approaches intends to merge the newer routing mechanism back 
into the latest B.A.T.M.A.N. version. The latter protocols, B.A.T.Mobile and PARRoT, 
follow a mobility-predictive routing approach and are explained in more detail in the 
following. The most recent development is Context-Adaptive PARRoT (CA-PARRoT) 
[586], which can be regarded as an extension to PARRoT and follows a hybrid machine 
learning approach. 

Routing protocols can be classified into reactive and proactive protocols. The first 
initiate a route-building process on-demand. Well-known examples are Ad-hoc On- 
demand Distance Vector (AODV) and DYnamic MANET On-demand routing protocol 
(DYMO). The latter maintain routing tables that are used for lookups when necessary 
and are updated periodically. Destination-Sequenced Distance Vector (DSDV) and 
OLSR are widely known proactive protocols. Greedy Perimeter Stateless Routing in 
wireless networks (GPSR) is a geo-based routing approach that considers mobility and 
communication as a dependent task. The route building is done by minimizing the geo- 
distance between sender and destination node with each hop. Extensive summaries 
about existing protocols are found in [486] and [470]. An empirical analysis of used 
protocols in vehicular networks is provided in [119]. 

Recent developments in the machine learning field have also had an impact on 
routing algorithms. The authors of [671] use a centralized Artificial Neural Network 
(ANN) to enable a Software Defined Network (SDN) approach for latency minimization in 
vehicular networks. However, reinforcement learning-based approaches allow routing 
entities to make autonomous decisions in a decentralized manner. In [99] Q-routing is 
proposed as an integration of autonomous routing decisions based on learned latency. 
The authors of [481] extend this with mobility-based metrics and take into account the 
swarm coherence of agents. A summary of channel and propagation models is given 
in [704]. The authors of [60] present an approach of learning from stochastic channel 
parameters. Intelligent routing algorithms and developments for ad-hoc networks might 
also have an impact on future network generations [17]. 
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4.4.3 Approaches: Enhance Routing by Prediction and Machine Learning 


In this subsection, the routing approaches of B.A.T.Mobile, PARRoT, and CA-PARRoT 
are presented. B.A.T.Mobile leverages cross-layer knowledge from the mobility domain 
to predict the agents’ future trajectory and integrate this information in the routing 
decision. Following the anticipatory mobile networking paradigm [107], an overall 
increase in terms of robustness and reliability was achieved. PARRoT, as a follow-up 
protocol, takes the mobility prediction approach and integrates it into a reinforcement 
learning process that utilizes abstract metrics [481] and represents the routing process 
[99]. CA-PARRoT extends this with a mechanism to compensate short-term influences 
and introduces a hybrid machine learning approach where the routing decision still 
relies on reinforcement learning, but a machine learning component is used to classify 
a Radio Environment Prototype (REP) and select an appropriate parametrization to 
achieve the best possible end-to-end performance. 


4.4.3.1 B.A.T.Mobile: Leveraging Cross-Layer Knowledge 
B.A.T.Mobile introduces a multi-factorial mobility prediction that classifies available 
information from the mobility control into three levels of assumed accuracy. An iterative 
prediction process is performed that always chooses the most accurate prediction 
method for the current step. A prediction width T is divided into N iteration steps to 
predict the future position p(t + T). The considered types of information, named in 
descending order of their accuracy, are: 

— The steering vector o; describes the current heading to the position of the next 
iteration step. This information is only available in the very first iteration. 

- Thecurrently targeted waypoint w(t). If the agent’s position is in a specific range of 
this waypoint, it is considered reached, and the next target is used for the remaining 
prediction process. 

-— The history of Ne previous positions. Every iteration step appends a new estimate 
to this list. It is used as a fallback solution to enable a mobility prediction even if 
no advanced information is available. 


The prediction result (t+ T) is then integrated into the periodically broadcasted routing 
messages of the underlying B.A.T.M.A.N. routing protocol and replaces the former 
Transmission Quality (TQ) metric for routing decisions. 


4.4.3.2 PARROT: Transition to an Autonomous Routing Process 

PARRoT inherits the mobility prediction of B.A.T.Mobile. The routing metric is not solely 
built on relative mobility; rather, it is gathered by a Q-learning process. Agents share 
their current and predicted positions, which are then used by the receiving agent to 
reconstruct their neighbor’s trajectory. Further, this is set into a relationship with the 
agent’s own trajectory, and the future availability of a link between these two agents is 
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Fig. 4.23: Interconnection of the routing protocol and the mobility domain for multi-factorial mobility 
predictions. Reprinted from [585]. 


expressed by the metric ®rgr(d, j), representing the Link Expiry Time (LET). Every agent 
evaluates its own cohesion by comparing its current set of neighbors with a previous set. 
This produces the metric ®con(j), which is shared through the routing messages. The 
reinforcement learning process is performed by reverse route building. An originator 
creates a routing message, referred to as chirps in PARROT, sets the reward value V to 
1.0, and broadcasts it. Recipient agents carry out the learning process, considering the 
originator, which will be the destination d in reverse route building, and the adjacent 
agent j, from whom the message was received, and process it to 


Q(d, j) = Q(d, j) + a [yo - Pyer(d, j) © Peon) - Vj - Qld, j)] . (4.11) 


Here, the learning rate a controls the impact of how new routing messages that deliver 
the reward V; affect the learned knowledge. The basic discount factor yo guarantees 
a path degradation. This is of particular interest when all other metrics become 1.0, 
e.g. in a static network. This measure then forces the shared reward to be decreased 
and prevents routing loops. The agents maintain a Q-table, where a quality indicator 
Q(d, j) exists for every destination/ gateway pair over which, a chirp has been received. 
This table is then utilized to feed the known routing table, which enables the packet 
forwarding process. 

Figure 4.24 shows the datagram of a PARRoT routing message that contains identi- 
fication data, age information, and position information. 


0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 


Originator SEQ TTL Cohesion ®coh Reward V 


p.x p.y p.z p.x 


Fig. 4.24: Byte datagram of a PARRoT routing message with a total length of 40 bytes. 
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Fig. 4.25: Timer-based update procedure to avoid the short-term effects of having too much impact 
on learned knowledge. ©[2021] IEEE. Reprinted, with permission, from [586]. 


4.4.3.3 CA-PARRoT: Introducing a Machine Learning-Enabled Context Adaption 
CA-PARRoT proposes further enhancements to PARRoT in order to increase robust- 
ness under distinct conditions and reduce the pre-configuration effort. As known from 
comprehensive simulations, PARRoT has shown a negative impact for time-variant en- 
vironments as short-term effects harm the accuracy of learned knowledge. B.A.T.Mobile, 
in turn, proves to be more robust because it buffers incoming information to smoothen 
the update process. Thus, a timer-based update procedure is also introduced to CA- 
PARRoT, as shown in Figure 4.25. It is divided into four phases, where an incoming 
value is pushed to a metric buffer in phase 1. The same value leads to an immediate 
Q-update in phase 2, yielding a temporary knowledge in phase 3 that is used to decide 
the forwarding of the current routing message. After an elapsed time Atu, the best can- 
didate is read from the metric buffer and triggers a Q-update whose result is persisted in 
the stored knowledge and is also used to update the routing tables in which the packet 
forwarding is managed (phase 4). 

Besides the refactored update process, a machine learning component is introduced 
to obtain a parametrization after the classification of the current environment. For 
this purpose, different radio environmental prototypes are considered, for which a 
parameter optimization is carried out in advance, and the best parametrization is 
provided to the routing protocol. Figure 4.26 shows the adaption approach, which 
starts with an initial monitoring of the signal strengths of incoming routing messages 
within an evaluation window. Statistical features are extracted and used to classify 
a prototype and select its parameters such as the learning rate a, the basic discount 
factor yo, and newly introduced A and w, which are used to exponentially weight partial 
Q-learning metrics. 

Random Forests (RFs), Support Vector Machines (SVMs), and Artificial Neural 
Networks (ANNs) are provided as classification methods. Figure 4.27 shows the classifi- 
cation accuracy that is obtained through a 10-fold cross-validation with the Lightweight 
Machine Learning for IoT Systems with Resource Limitations (LIMITS) [633] framework, 
which also enables a model export. 
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Fig. 4.26: Context-adaption approach for classifiying radio environment prototypes (REPs) and 
selecting parameters accordingly. ©[2021] IEEE. Reprinted, with permission, from [586]. 
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Fig. 4.27: Cross-validation accuracy for the classification of radio environment prototypes. ©[2021] 
IEEE. Reprinted, with permission, from [586]. 
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Fig. 4.28: (a) shows the trajectories of 5 agents, following a random waypoint mobility pattern in the 
three-dimensional reference playground. (b) shows an application-driven mobility, where UAVs fly 
over clusters of ground vehicles. ©[2021] IEEE. Reprinted, with permission, from [635]. 


4.4.4 Performance Evaluation of MANET Routing Protocols 


In order to carry out a performance evaluation among different routing protocols, 
a reproducible scenario setup is required that provides a comparable frame for all 
candidates and makes the overall performance dependent on the evaluated protocol. 
As Key Performance Indicators (KPIs), the Packet Delivery Ratio (PDR) and the mean 
latency are considered as end-to-end metrics. 

A reference scenario is constructed, where ten agents establish a video stream be- 
tween two agents with a Constant Bit Rate (CBR) of 2 Mbit/s. Besides the communication 
aspect, the agent mobility has a major impact on the evolution and characteristics of 
the network topology. Figure 4.28 (a) shows a generic Random WayPoint (RWP) mobility 
in the three-dimensional playground of the reference scenario. The agents move witha 
constant speed of 50 km/h and immediately choose their next target when they reach 
their current one. RWP mobility is considered as a benchmark mobility. Figure 4.28 (b) 
shows an example of an application-driven mobility, where UAVs hover clusters of 
ground vehicles to extend their communication capabilities [624]. This mobility pattern 
involves hybrid types of vehicles that possess inherently different characteristics—a 
particular challenge for efficient routing protocols. 

Another crucial aspect is the choice of a channel model. In rural environments, 
a Line-Of-Sight (LOS) connection between two agents is usually given. With a lack of 
objects in the playground, multipath propagation can be neglected, and the free-space 
path loss L is proportional to the exponentially weighted distance d with L œ d”, where 
n is the attenuation coefficient. This simple but generic model is utilized in the reference 
scenario for performance evaluation. For other environments, more complex path loss 
models need to be considered. In urban areas, the impact of objects leads to a higher 
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importance of multipath propagation, as the LOS path is superposed by reflected signal 
paths. Therefore, the Nakagami model presents a stochastic impact on the Received 
Signal Strength (RSS) in addition to the free-space path loss. The empirical analysis 
of [119] points out commonly used channel models and well-established reference 
protocols in vehicular research with which the novel protocols are compared. 


4.4.5 Results 


In this subsection, a comprehensive simulative performance analysis is presented. At 
first, a scalability analysis is carried out to access the behavior of routing protocols 
for different types of networks. Afterward, the end-to-end performances are evaluated 
in scenario studies, that present potential fields of application. To assess an upper 
bound, an optimal PDR for free-space conditions is introduced, which represents a 
post-processed analysis of the agents’ positions and the theoretical availability of routes. 
Thus, it is considered a mobility-constrained upper bound. 


4.4.5.1 Scalability Analysis 

Number of Agents in the Network As seen in Figure 4.29 (a), the number of agents 
has a significant impact on the PDR of routing protocols. First, for a low number of 
agents and a low density in the playground, an unneglectable PDR limitation due 
to mobility can be observed. B.A.T.Mobile, PARRoT, and CA-PARRoT outperform all 
established protocols for higher density networks. (CA-)PARRoT’s course of the PDR 
is close to the mobility constraint and only shows an impact of routing overhead, 
which increases for larger scales. Thus, the results show a good scalability, proving 
(CA-)PARRoT to be suitable for high-density networks. 


Impact of Speed on the End-to-End Performance MANETs are characterized by 
their mobility, which can mean high agent speeds in many cases. Figure 4.29 (b) shows 
the performance for a range of slow-moving agents up to highly mobile scenarios. High 
speeds require routing protocols to adapt to the network topology very quickly. For 
speeds over 100 km/h, most routing protocols fail to provide reliable routes, which 
causes the PDR to drop and appear undisclosed. Only (CA-)PARRoT, which is affected 
by the higher requirements of increasing speeds, is capable of providing high PDRs by 
anticipating the network topology and compensating link losses. 


4.4.5.2 Scenario Studies 

UAV Communication in a Rural 3D-Playground As seen in Figure 4.30, the novel 
routing protocols, B.A.T.Mobile, and (CA-)PARRoT, outperform the established reference 
protocols. The study considers a rural three-dimensional playground with an air-to-air 
communication between two UAVs. As the communication range does not cover the 
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Fig. 4.29: Scalability analysis of routing protocols. PARRoT and CA-PARRoT outperform the consid- 
ered reference protocols significantly and, thus, provide a robust communication behavior even for 
challenging conditions. 


whole playground, a mobility constrained optimum is calculated, as a routing path is 
not necessarily available for every point of time. 

All established protocols fail to provide a reliable PDR above the 70 % mark, which 
allows CA-PARROT to outperform them by 48 % in means. CA-PARRoT achieves a 3 % 
and 19 % higher PDR than PARRoT and B.A.T.Mobile, respectively, and is thus the best- 
performing mobility-predictive protocol, with only a 5% gap to the theoretical upper 
bound. The proposed protocols, therefore, show gradual improvements with every 
development stage. Also, considering the latency, CA-PARRoT performs best and shows 
a 21% reduced latency compared with OLSR, which is the lowest latency established 
protocol in this analysis. 


Challenging Conditions in Urban Areas Figure 4.31 shows the end-to-end perfor- 
mance for urban radio conditions. B.A.T.Mobile and PARRoT outperform the reference 
protocols for both KPIs under consideration. PARRoT’s performance is a bit weaker 
than that of B.A.T.Mobile due to the immediate impact of incoming routing packets, 
where each packet is used for updates, and short-term effects compromise the learning 
accuracy. The proposed CA-PARRoT is able to avoid this behavior and achieves the high- 
est reliability of all considered routing protocols. Also, in terms of latency, CA-PARRoT 
shows up to 51%, lower values compared with established protocols and a 9 %, lower 
latency compared with that of the B.A.T.Mobile. In general, higher latencies can be 
observed. These are caused by spontaneous link losses that enforce more buffering in 
the MAC layer. 
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Fig. 4.30: The mobility-predictive routing protocols outperform the established references in terms 
of reliability, expressed by the PDR. The proactive integration of trajectory knowledge establishes 
a robust route finding and advances the performance towards the mobility-constrained optimum. 
©[2021] IEEE. Reprinted, with permission, from [586]. 


Performance of MANET Routing in Application-Driven Scenarios The previously 
presented analysis respects different scalability and radio propagation influences but 
uses a generic random waypoint mobility model. As real-world applications of MANETs 
may have mobility in coincidence with the corresponding task, two examples are 
studied in the following. 


Aerial Cluster Hovering 

UAVs are used to hover over clusters of cars. In this analysis, ten UAVs are deployed to 
cover a total of 50 cars, of which ten are equipped with communication interfaces. The 
remaining 40, therefore, impact only the cluster selection and mobility of other cars, 
but not the communication system. Figure 4.32 (a) shows the PDR of this scenario. The 
incremental position updates reduce the accuracy of the mobility prediction and cause 
B.A.T.Mobile to have high performance losses. However, PARRoT also uses mobility 
prediction, but is still able outperform all other protocols due to a considered cohesion- 
aware metric in the learning algorithm. 


Distributed Dispersion Detection 

To explore plumes, random mobilities are not effective. The distributed dispersion 
detection (DDD) is a mobility model that is aware of maintaining the cohesion of the 
UAV swarm during exploration and, therefore, provides a high amount of available 
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Fig. 4.31: End-to-End performances in a three dimensional playground with urban radio conditions. 
©[2021] IEEE. Reprinted, with permission, from [586]. 


routes. This is also reflected in Figure 4.32 (b), as all routing protocols show a similar, 
high performance. Nevertheless, B.A.T.Mobile and PARROT are able to reduce negative 
outliers and provide a more reliable communication. 


4.4.6 Conclusion 


The results show that the proactive integration of mobility-domain knowledge enables 
a significantly more robust behavior, which can outperform established routing ap- 
proaches in a vast variation of challenging conditions. The utilization of machine 
learning and reinforcement learning adds an additional gain and robustness, as the 
comparison between B.A.T.Mobile and (CA-)PARRoT has shown. The more the routing 
protocols are aware of their environment, e.g., its mobility and radio conditions, the 
higher the achievable robustness becomes. Therefore, high KPIs could be observed, 
even in high-scale scenarios. Intelligent routing algorithms are a key component for the 
realization of efficient infrastructure-less device-to-device communication, not only for 
WiFi-based networks but also for other technologies such as cellular approaches. The 
lack of a centralized unit promises lower latency and enables the network participants 
to learn and adapt to their situation based on local observations, reducing the need for 
communication expensive status updates. 
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Fig. 4.32: Achieved packet delivery ratios for application-driven scenarios. 
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Abstract: Vehicular traffic is a complex system with multiple challenges. For example, 
highways and urban traffic networks, different vehicle types with varying maximum 
velocities (cars, trucks, public transportation, etc.), and varying driving behaviors each 
impacts traffic flow uniquely and strongly. This means that in order to minimize the 
number of congestions and the average travel times, it is necessary to analyze, model, 
and simulate traffic in multiple different scenarios. 


In the following, we will introduce cellular automaton models for different scenarios. 
These cellular automaton models aim to reproduce macroscopic traffic phenomena 
through microscopic simulations. With the help of these simulations, we are able to 
analyze, understand, and predict traffic in the given scenarios. Lastly, based on the 
predictions, we can attempt to simulate the same scenarios with small adjustments in 
order to maximize traffic flow and minimize travel time. 


To this end, we will start with introducing and analyzing highway traffic. Here we 
will focus on applying real-life weather data and how it impacts traffic flow. Next, 
we will investigate, where the limited space and regular interruption of urban traffic 
flow by traffic lights and intersections result in new and additional constraints. Lastly, 
communicating and automated vehicles will be introduced into the simulations. The 
different reaction times, behaviors, and the human-robot interaction are expected to 
result in new challenges that have to be investigated and predicted. 


4.5.1 Introduction 


The topic of vehicular traffic is gaining more and more attention amid rapidly growing 
numbers of vehicles on the road and increasing amounts of traffic congestion, leading 
to longer average travel times and more fuel consumption. The road capacity cannot 
be increased indefinitely through the addition of new lanes, which means that other 
methods to increase road capacity or use it more efficiently need studying. 

In order to use the road more efficiently, traffic and congestion have to be under- 
stood better. To this end, we will first analyze and model traffic behavior on highways 
in Section 4.5.2 with two main goals. The first is to understand the creation of jams 
and their influence on the traffic flow more deeply. For this, we will use the model by 
Nagel and Schreckenberg [467]. The second is to model the asymmetric lane changing 
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required in multiple countries around the world. For this, we will use the one-lane 
model by Lee et al. [379]. Asymmetric lane changing means that agents are supposed 
to drive on the most outer (right) lane unless they overtake a slower vehicle. 

After analyzing some general traffic flow behavior, we take a look at two more 
specific problems, the influence of weather on highway capacity, and the missing data 
of traffic detectors for real-time applications. 

While these analyses allow better traffic flow predictions on highways, the situation 
in urban and inner-city networks is more complicated due to a higher degree of interac- 
tions of crossing flows and a regular interruption of the traffic caused by traffic lights. 
Therefore, in Section 4.5.3 we adapt the model by Lee et al. to include driving behavior 
before traffic lights and in intersections. Furthermore, traffic flow, based on empirical 
data from Diisseldorf’s inner city, is analyzed. There, different analytical real-time 
routing methods are applied to minimize traffic jams and the average travel time of 
the agents. These analyses help identify traffic bottlenecks that have a high impact 
on the creation of traffic jams and the increase of the average travel time. However, 
space in urban and inner-city networks is often very restricted, so it is not possible 
to increase the capacity of these bottlenecks by building more lanes. Due to this, in 
a follow-up simulation, we changed one of these bottlenecks in a way that two lanes 
(one leading in and the other leading out of the city) at the chosen intersection were 
dynamically changed to either lead into or out of the city. This way, the road capacity 
could be dynamically adjusted to fit the changing demand of commuters. 

Lastly, in Section 4.5.4, we simulate heterogeneous traffic with automated and 
human-driven vehicles. It is expected that automated vehicles will reduce or even elim- 
inate traffic jams at 100 % penetration. However, it will take multiple decades until 
100 % is achieved, and the impact of automated vehicles in heterogeneous traffic is 
unclear due to the different behaviors of automated and human-driven vehicles. Due to 
this, we adapt our model to simulate the different behaviors of automated, communi- 
cating automated, and human-driven vehicles. The goal is to simulate heterogeneous 
traffic where the three different vehicle types mix and then predict how this will impact 
traffic flow and road capacities. 


4.5.2 Highway Traffic Data Aggregation 


The analysis of highway traffic flow can be divided into two topics. General traffic 
behavior and the real-time traffic situation will be analyzed in this section. For that, 
we will first model and simulate how traffic jams are created to understand better how 
and when free-flowing traffic transitions out of free flow due to traffic jams. After that, 
empirical data is used to create realistic lane-changing behavior while considering 
asymmetric lane changing rules, as they are applied in countries like Germany or 
France. 
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After that, the impact of real-time weather data on jam creation and travel velocities 
will be analyzed, and the results will be used to increase traffic-flow predictions. The 
prediction of real-time traffic requires real-time data, which sometimes can be missing 
due to failing detectors or communication channels. Therefore, the last part of this 
section will deal with the problem of how to replace missing data accurately. 


4.5.2.1 Traffic Jam Analyses 

The goal of traffic research is often to prevent traffic jams, or reduce their lifetimes. To 
this end, Bette et al. [59] analyzed traffic jams using the Nagel-Schreckenberg model 
[467]. For that, the traffic density was determined at which free-flowing traffic transitions 
into jammed traffic based on a stability criterion. Afterward, the ratio of jammed cars 
was separated into different mechanisms, the jamming rate, jam lifetime, and jam 
size. It was shown that small jams already occur at very low densities and that the 
increasing life-time of these jams at higher densities is what leads to the transition of 
the traffic flow from free-flowing to jammed traffic. Furthermore, exponents that control 
the scaling of all three jam mechanisms close to the critical density have been derived 
from random walk arguments. 


4.5.2.2 Asymmetric Lane-Changing Rules 

Lane changing in many countries is asymmetric because drivers are required by law to 
drive on the most outer (right) lane as long as they do not overtake a slower vehicle. This 
asymmetric driving behavior creates multiple differences compared with highway traffic 
without overtaking restrictions. For that reason, empirical data from two countries with 
such asymmetric rules (Germany and France) have been considered in [248]. A multi- 
lane cellular automaton model with asymmetric lane changing rules has been created 
and calibrated based on this empirical data. 

This model is based on the one-lane model by Lee et al. [379], where agents have 
different driving behaviors and a maximal deceleration capability. These two points 
together allow the model to reproduce accidents due to miss behavior. Because of this, 
the lane-changing rules have to fulfill three functionalities. Firstly, they have to be safe, 
which means that the distance to both the preceding and to the following agents has 
to be large enough, depending on the current velocity. Secondly, the agents should 
change to an outer lane (one on their right) as soon as the lane change is safe and they 
do not have to decelerate, while they change to an inner lane (one on their left) only if 
it is safe and they can accelerate or prevent a forced deceleration. Lastly, the agents 
have to be prevented from overtaking a slower driving vehicle on a more outer lane. 

After adding rules to ensure all three points, the model is able to reproduce empiri- 
cal lane usage for two- and three-lane highways. Furthermore, a higher number of lanes 
can be simulated easily after one parameter of the model is re-calibrated accordingly. 
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Tab. 4.5: Maximum vehicular traffic flows obtained from the whole empirical dataset. Here, w repre- 
sents the mean of all available water film sensors at a certain point of time. 


Surface Range of water film thickness w Maximum flow 


Dry w < 0.15 mm 1960 vehs/(h, lane) 
Damp 0.15 mm < w < 0.9 mm 1720 vehs/(h, lane) 


Wet w > 0.9 mm 1320 vehs/(h, lane) 


4.5.2.3 Influence of Weather Data on Traffic Predictions 

Vehicle-2-X communication has grown rapidly within the past two decades, which has 
increased the availability of extended Floating Car Data (xFCD) that can be applied 
in the field of traffic information and improvement. One possible integration of this 
additional real-time data is the inclusion of weather data in traffic-flow predictions 
[246]. In order to identify the current weather on the road, the water film thickness is 
taken by local weather stations. Vehicles equipped with xFCD are able to gather and 
communicate this data through the use of rain-sensing windscreen wipers, which react 
to water spray. While this vastly increases the available data, floating car data can be 
more unreliably due to the limited radio spectrum available to transmit this data. An 
efficient way of communication is developed in Section 5.2 to transmit data reliably and 
efficiently. An analyses of the correlation between the average velocities of passenger 
cars and water film thickness on the road showed a strong negative correlation of up 
to -0.4 for rush-hour traffic. This means that the incorporation of weather data into 
traffic information systems is expected to be exceptionally beneficial for commuters. 
Furthermore, Table 4.5 shows that an increased water film thickness also decreases 
the minimal road capacity above which the traffic flow becomes unstable and can 
transition away from free-flow, which indicates that the inclusion of weather data 
improves traffic-jam predictions. 

After the impact of water film thickness on the traffic flow was analyzed, the new 
insights were added to the previously discussed asymmetric multi-lane version of 
the Lee et al. model. For that, an additional dallying parameter p(w), which depends 
on the water film thickness on the road w, was introduced. A higher value of this 
parameter increases the probability of an agent decelerating even if the leading vehicle 
is far enough away for its velocity to be safe. The results in Figure 4.33 show a good 
agreement with the empirical data for the roughly 6 km-long Autobahn section chosen 
for the study. This also shows that an accurate traffic prediction needs reliable real-time 
data in order to work. However, empirical data is often not reliable enough to ensure 
constant real-time updates, and data can be missing. This missing data then has to be 
replaced by approximations in real time. 
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Fig. 4.33: Simulated breakdown probability for two different degrees of surface wetness. 


4.5.2.4 Replacing Missing Data 

Previously, in the exponential smoothing prediction method [136], missing data of a 
traffic detector was replaced analyzing the historical data of this detector. This has 
two major downsides. Firstly, the historical data ideally has to be from the same day 
and time in the past 30 weeks of all D detectors. If, for example, a detector stops 
communicating its findings on a Monday at 12 o’clock, then the data of 12 o’clock of 
the past 30 Mondays has to be considered. This means that all data points of all D 
detectors have to be saved for at least 30 weeks in order to make accurate predictions 
on missing data points. The second problem is that the traffic situation ten weeks 
before, for example, does not have to be the same as today. Accidents or road works, for 
example, could shift the traffic flow from one street to another, which would strongly 
increase the error of such predictions. Because of this, a new method to replace missing 
data has been introduced in [247]. 

In this new method, 60 minutes of historical data of the surrounding N detectors 
from the preceding week is taken. This data is then used to train a Poisson Dependency 
Network (PDN) [249] (which is a form of a Poisson model explained in Section 4.1.2.6). 
This PDN shows how strongly the traffic-flow data at one detector point correlates with 
the data of the detector that has missing data. Then, in a final step, the real-time data 
of the other d detectors is inserted into the PDN to fill the missing data point. 

In order to test the method, empirical data was taken, and a prediction of it was 
made as if it were missing. Then the empirical data was compared with its prediction in 
Figure 4.34. One can see that the PDN is closer to the actual data than the exponential 
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smoothing prediction that just used the past N weeks of traffic data of the “missing” 
detector. A general test at different times of the day and different days of the week 
showed that the PDN not only uses less historical data but also predicts missing traffic 
data more accurately. A more in-depth analysis of the problem of traffic flow prediction 
is given in Section 4.1. 
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Fig. 4.34: Example empirical data together with exponential smoothing and PDN predictions based 
on the data. 


4.5.3 Urban Traffic Simulations 


Inner-city traffic is more complicated than highway traffic due to more interactions of 
vehicles with different travel directions and intersections with or without traffic lights. 
In order to analyze and simulate inner-city traffic, the Lee et al. model, which was used 
as a basis for highway traffic simulations in the previous section, was modified with 
additional rules in [710] to reproduce empirical intersection traffic data. In another 
work, different methods to dynamically optimize inner-city traffic through different 
routing methods and a dynamical application of lanes were analyzed [709]. There, we 
were able to show that while traffic flow is often above the network capacity, one can 
decrease the number of traffic jams and the average travel time through more dynamic 
routing methods and the use of the infrastructure. 


4.5.3.1 Cellular Automaton Model 

The simulation of urban traffic with a model based on that by Lee et al. [379] has to 
include a couple of complex situations. Different intersections, for example, can be very 
different in their structure and serve different purposes. Some intersections can have 
lanes on which one is only allowed to turn left, while another intersection with the same 
number of lanes and roads allows turning left or keeping going straight from the most 
inner (left) lane. Furthermore, agents have to ensure that they arrive on the right lane, 
depending on their route before they arrive at the traffic lights. Finally, vehicles that 
turn left within an intersection sometimes have to take into account that the crossing 
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traffic has a green light at the same time, and they are only allowed to turn if they do 
not disturb this ongoing traffic flow. 

After these points were included within a cellular automaton model in [710], the 
model was calibrated so that the time requirement tg ~ 1.6 +0.6 s is in accordance with 
empirical data. To calculate the time requirement tg, the number of cars that arrive at a 
red traffic light in front of an intersection within the simulation is set higher than the 
number of vehicles that can pass this traffic light once it turns green. When the traffic 
light turns green, the time each agent requires to move into the intersection after the 
preceding agent moved into it is taken. This time is the time requirement tg and, after 
3 - 4s of a green traffic light, it averages out to around tg = 2.23 + 0.04s. This is in 
accordance with the empirical data. 

Afterward, an intersection was modeled in that the traffic lights that control left- 
turning traffic have green at the same time as the crossing traffic flow. Because turning 
agents decelerate before they turn, the time requirement is already higher than for 
traffic that does not turn, which means that the lane capacity is lower. Furthermore, 
this time requirement increases strongly in correlation with the crossing traffic flow. 
This means that in order to minimize traffic jams and travel times in urban traffic, one 
not only has to consider the traffic flow from one point to another but also all other 
traffic flows within the network. 


4.5.3.2 Inner-City Traffic Optimization 

In [709], inner-city traffic was analyzed and simulated with the help of empirical data 
from Dtisseldorf’s inner-city. Based on the analyses of the traffic flow, different routing 
methods were used to guide the traffic through the application of real-time data. The 
analyses showed that the traffic capacity of roads inside the city is not enough to cover 
demand at all times. However, a routing method that aims to make maximal use of the 
road capacity rather than route vehicles depending on their travel time would improve 
the traffic flow significantly and could reduce the average travel time by up to 23 %. The 
downside of optimizing the traffic depending on the network capacity rather than the 
travel times is that if the traffic flow is below the road capacity, vehicles will take roads 
with longer travel times than necessary, which increases fuel consumption and travel 
times unnecessarily. Due to these findings, a new routing method was developed that 
considers both the road capacity and the travel time of each agent individually. As one 
can see in Figure 4.35, the new routing method (green) recreates shorter travel times 
than the network optimization method (red) at low traffic volumes while also reducing 
the travel times over routing methods where each agent uses the route with the shortest 
travel time (blue and purple) at high travel volumes. 

After it was shown that a more efficient routing method would decrease the av- 
erage travel time and make more use of the given road capacity, we tested how the 
road capacity itself could be dynamically optimized. To this end, one of the busiest 
intersections leading into and out of the inner city was identified. This intersection 
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Fig. 4.35: Travel times for different traffic assignment methods. The lower lines represent the aver- 
age travel time taken over 10 minutes while the upper line represents the 95 % confident interval 
upper bound. 


connects the city ring with one of the main roads leading through the city. One of the 
lanes on the city ring, as well as one of these on the road leading through the city, were 
changed to dynamic lanes. Traffic into and out of Düsseldorf is very unbalanced at 
different times of the day due to the high number of commuters. The dynamic lanes 
were able to reduce the commute time into the city in the morning while also decreasing 
the travel time needed to leave the city in the afternoon by increasing the road capacity 
where it was needed and decreasing it where it was not needed. Through this dynamic 
change done to a single intersection, the average travel time could be reduced by over 
10 % without changing the way vehicles currently choose their routes. Note that these 
travel-time reductions aren’t necessarily the optimal reductions. The goal is rather to 
understand the network and find its bottlenecks. How to find the optimal routes for 
each vehicle to reduce the global average travel time is analyzed in Section 4.1.4. There, 
anew method that applies a reinforcement learning algorithm is simulated, and the 
results are compared against others. 


4.5.4 Automated Vehicular Traffic Flow 


Even though dynamic routing and dynamic shifting of the road capacity can reduce the 
average travel time, this is not a permanent fix for the increasing number of traffic jams 
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since the number of vehicles is expected to continue increasing in the near future. By 
contrast, automated vehicles in an 100 % automated traffic are expected to increase road 
capacity, reduce travel times, fuel consumption, and accidents significantly. However, it 
is not clear yet how automated vehicles will be introduced into the traffic. Their impact 
on the traffic dynamics in heterogeneous traffic, when they mix with human-driven 
vehicles, has been widely discussed. Traffic will be heterogeneous for multiple decades 
due to an average lifespan of a vehicle of around ten years [42]. 

In order to predict heterogeneous traffic behavior, a new cellular automaton model 
was introduced in [708]. One of the big challenges of cellular automaton automated 
vehicle traffic is that Automated Vehicles (AVs) and Communicating Automated Vehicles 
(CAVs) have a reduced reaction time compared with human-driven vehicles (HVs). 
For CAVs, this reaction time could go as low as the time it takes to communicate, 
which is currently around 0.1s but could go even lower with 5G, which is expected to 
become an important method for future connected and automated vehicles. Currently, 
communication does not always take a fixed time length but instead varies depending 
on the limited available radio spectrum. If the full spectrum is used, communication 
can take a lot longer than 0.1s, depending on the means of communication. An in-depth 
analysis of the problems of communication and how it is realized is given in Section 5.2. 
For the remainder of this section, we will assume a stable communication with a 0.1s 
communication time. Therefore, the first step toward simulating AVs or CAVs with 
cellular automaton models was to reduce the time-step length of the model to 0.1s per 
time step. 


4.5.4.1 Reduced Time-Step Length 

The cellular automaton model introduced in [708] is based upon the Pottmeier et al. 
[508] accident-free version of the Lee model [379]. Human-driven vehicle agents in 
this model can judge their situation optimistically or pessimistically. If they judge it 
optimistically, they do not expect their leading vehicle to decelerate strongly (only 
dawdle). In this case, they can follow the leading vehicle with less than the minimum 
safety distance, something that is often found in empirical data. If they judge their 
situation pessimistically, they expect the leading vehicle to decelerate at any moment. 
They do not follow with the minimum safety distance but even apply an additional 
safety distance, depending on their velocity. 

The newly introduced model reproduces this behavior while also having a 0.1s 
long time step and keeping an average reaction time of 1s for the HVs. For that, multiple 
changes were made to the calculation of the safe velocity, dawdling, and the judgment 
of the situation. However, Figure 4.36 shows that the resulting 0.1s long time-step 
model presented in [708] is still able to reproduce realistic human-driven vehicle traffic 
comparable with the model presented by Lee et al. [379] and the modified version of that 
by Pottmeier et al., which are both known to reproduce empirical traffic well [508]. As 
one can see, the main difference between the three models is at a density of 20-45 Veh 
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Fig. 4.36: Fundamental diagram for the three different human-driven vehicle cellular automaton 
model. 


in the so-called synchronized traffic phase [317]. Pottmeier et al.’s modification of the 
Lee model only differs from the Lee model in the calculation of the judgment of the 
situation. Agents in this model are less likely to be optimistic, which prevents accidents 
but also increases the average vehicle following time and so reduces the traffic flow. 
The reduced time-step length also uses this curbed optimism due to which the traffic 
flow is initially below that of the Lee model. However, due to changes to the dawdling, 
the velocity distribution is more uniform in the 0.1s time-step length model, which 
strengthens the synchronization and increases the traffic flow towards the end of this 
traffic phase. 

Overall the differences between the three models are smaller than the fluctuations 
observed in empirical traffic, and they all reproduce realistic traffic flow. 


4.5.4.2 Heterogeneous Automated Vehicle Traffic 

AV and CAV agents have multiple differences compared with HV agents. Their three 
most important differences are reaction time, dawdle, and behavior calculation. An AV 
has a reaction time of 0.5 s and a (CAV) one of at least 0.1 s compared with the 1s of HVs. 
Neither CAVs nor AVs dawdle at all, while HVs dawdle with a probability of up to 37 % 
[379]. Lastly, CAVs and AVs do not judge their situation optimistically or pessimistically, 
but instead, they always follow the leading vehicle with at least the minimum safety 
distance. If the leading vehicle is human-driven, then they follow with more than the 
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minimum safety distance to be able to react to unpredictable human behavior without 
creating strong deceleration waves. 

These differences between AVs, CAVs, and HVs were defined and included in the 
model in [708] before heterogeneous traffic flow was simulated. The simulation results 
are shown in Figure 4.37, together with the theoretically predicated capacity increase 
[215] for heterogeneous traffic. One can see that automated vehicular traffic is expected 
to increase the traffic capacity compared with homogeneous human-driven. The effect 
is even stronger for communicating automated vehicles. A reduced reaction time means 
that the minimum safety distance (and so the average vehicle following time) is lower, 
which allows higher traffic flow at similar densities. Furthermore, the reduced reaction 
time also means that these automated agents overreact less to human dawdling, which 
reduces deceleration waves. This effect is strengthened because they do not apply 
optimistic or pessimistic behavior states. If an HV agent changes its behavior from 
optimistic to pessimistic, it needs a higher distance to its leading vehicle even if the 
velocities would not change. This means that the agent has to decelerate more than 
AVs or CAVs that only have to decelerate as much as the leading vehicle to keep up the 
same distance. 
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Fig. 4.37: Road capacity increase over homogeneous human-driven vehicular traffic depends on the 
percentage of automated or communicating automated vehicles. 


However, while these results show an improved road capacity in every heterogeneous 
traffic situation compared with purely human-driven traffic, the model was also able 
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to reproduce an increased rear-end-collision risk due to human-robot interactions. 
Human drivers tend to drive with less than the minimum safety distance if they judge 
the traffic situation optimistically and do not expect to have to decelerate within the next 
couple of seconds. Before, when they followed another HV agent or even an AV agent, 
the remaining safety distance the agent used was just enough to prevent accidents if 
the traffic situation suddenly changed. Now, however, because the CAVs are able to 
react with only one-tenth of the time the average human needs, the velocity difference 
between a decelerating CAV and the following HV is so much larger than before that 
the following HV is not able to prevent an accident once the agent has reacted. 

This shows that the different behaviors of HVs and CAVs have the potential to 
increase the rear accident risk. However, the results shown in Figure 4.37 are simulated 
after this accident risk was prevented through a change in the behavior of CAVs. If an 
HV agent n follows a CAV agent n + 1 with less than the minimum safety distance, 
then the CAV agent increases its distance to its leading agent n + 2. This way, if agent 
n+ 2 decelerates, then agent n + 1 can decelerate after driving this additional distance, 
which gives agent n the time to react to the brake lights of agent n + 2, thus preventing 
accidents. 


4.5.4.3 Conclusion 

In this section, we highlighted different analyses of vehicle traffic scenarios on highways, 
in urban areas, and among heterogeneous traffic where automated and conventional 
vehicles are mixed. The main goal of those analyses is to understand and predict traffic 
better. Furthermore, the already existing cellular automaton vehicle model introduced 
by Lee et al. [379] was modified for multiple occasions. 

We were able to show that traffic jams already form at very low densities and that 
the increase in their duration leads to a transition of free-flowing to jammed traffic. 
While a critical density from that on free flow could transition to jammed traffic flow, 
it was also shown in other works that this critical density is not fixed and can vary. 
On highways, the local weather in the form of rain (measured by water film thickness) 
was identified as an important influence on traffic flow, while urban traffic is mostly 
dominated by traffic lights. Both problems were analyzed through the use of real-life 
data as well as simulations of a modified Lee model. 

Through the results of those analyses, the accuracy of traffic forecasting could be 
improved as long as enough real-time data was given. Unfortunately, the reliability of 
empirical detectors is often not fully guaranteed, and detectors can malfunction or not 
communicate data. For such cases, we have introduced a new method to fill missing 
data gaps that uses less historical data and is more accurate. Lastly, we modified the 
Lee model to predict the impact automated and communicating automated vehicles 
will have in future traffic when mixing with conventional vehicles. 
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Abstract: Due to its pervasiveness and convenience, crowdsensing is regarded as an 
effective method to collect specific data. This section surveys projects that take advan- 
tage of embedded crowdsensing to collect pavement condition data and describe how 
crowdsensing platforms conduct road damage detection using deep neural networks 
with images captured with smartphones. Before such discussion, we explore how to 
motivate users to participate in low platform-cost crowdsensing tasks. Our research 
models the pavement crowdsensing problem and designs new incentive mechanisms 
based on a platform-driven greedy algorithm. Through extensive simulations, the per- 
formance of the incentive mechanisms is evaluated and compared in different scenarios 
in terms of the platform cost and the overall task completion time. The best of them can 
reduce the total completion time by half compared with the reverse auction incentive 
mechanism. We conclude this contribution with future work discussions. 


4.6.1 Introduction 


As road infrastructure increases in size and complexity, innovative solutions must be 
developed to cope with road degradation. With the current methods used to collect 
road condition information, covering 4.18 million miles of road in the U.S [555] isa 
difficult undertaking. One viable solution is to create a crowdsensing platform where 
smartphone users collect road pavement data. In this case, users could automatically 
detect and report poor road conditions using embedded cameras, accelerometers, and 
4G/5G networks. However, such a network would require an active user base anda 
means of maintaining crowdsensing participation. Thus, crowdsensing schemes must 
design an incentive structure for continuous user activity. A crowdsensing platform 
utilizes an incentive mechanism to motivate user participation, produce diverse data 
pools, and generate quality information. 

Later in this section, we will formally express the incentive mechanisms as one 
of the key components of crowdsensing platforms. The mechanisms designed in this 
section are evaluated through experimental procedure, and vary based on the reward 
distribution assigned to sensing tasks. Favorable mechanisms are those that exhibit 
both low platform cost and total operation time when completing sensing tasks. Note, 
these criteria are not the only metrics for measuring incentive mechanism efficacy. 
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However, platform cost and total operation time metrics are sufficient for providing a 

solution framework to the purposed crowdsensing problem. 

Our crowdsensing platform is parameterized with nine unique incentive mecha- 
nisms, and aims to collect pavement condition data for varying percentages of a fixed 
area with economic feasibility in mind. All mechanisms are based on a platform-driven 
greedy algorithm that motivates users to select sensing tasks that can provide the high- 
est net profit margin for the participant. Eight of the nine incentive mechanisms are 
uniquely defined in this section while an additional incentive mechanism is motivated 
from recent related work in crowdsensing literature. Our results provide guidance in 
selecting the best incentive mechanism in different settings of pavement crowdsensing. 

Here are the key contributions of the section: 

— The incentive mechanisms we design can effectively avoid the cost explosion prob- 
lem as users choose their sensing tasks before starting to work on them. Thus, 
sensing tasks can only be committed to by one participant at a time. Cases where 
a sensing task should be reexamined—possibility due to poor readings—can be 
addressed by modeling repeated sensing tasks. 

— Our mechanisms enables users to select sensing tasks that offer the highest net 
profit margin based on a greedy algorithm. 

— The total operation time of our approach is reduced compared with that of the 
task-reverse-auction incentive mechanism. The results highlight this claim and 
provide solutions for crowdsensing given a target area within a limited budget. 


The rest of this discussion is organized as follows: survey the related work, introduce the 
research problem and its model, present our incentive mechanism solutions, construct 
simulations for evaluating the incentive mechanisms, discuss the evaluation results, 
consider augmented machine learning techniques, and conclude with final statements. 


4.6.2 Incentive Mechanisms 


4.6.2.1 Existing Monetary Incentive Mechanisms 

Zhang [763] and Jaimes [299] both assort incentive mechanisms by the types of in- 
centives. In Jaimes [299], monetary and non-monetary incentives are compared. Non- 
monetary mechanisms [158, 166] rely on the continued participation of users due to 
intrinsic motivations. Monetary mechanisms [15, 347, 356, 380, 381, 735, 743, 762, 766] 
rely on the direct backing of fiat money or indirect backing of fiat money through alter- 
native currencies. According to a survey paper [763], monetary incentives will be more 
likely to motivate users to complete the sensing tasks than non-monetary incentives. 
Therefore, a monetary mechanism is more fitting for crowdsensing and will be consid- 
ered in our discussion. However, picking the correct monetary incentive mechanism 
poses additional challenges. 
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When deciding on which monetary mechanism to use, one must define the optimization 
criteria that best suits the crowdsensing scheme. Examples of such criteria include 
economic feasibility, area coverage, data quality, fairness, and time duration. Produc- 
ing a platform that optimizes one or more of these categories is non-trivial because 
the typical problem framework of a crowdsensing scheme comes down to exponential 
time complexity problems or typical game theoretical model challenges. In the case of 
economic feasibility, a platform must be designed carefully in order not to allow users 
too much control over the price of their service. This is known as the cost explosion 
problem, and is one of many challenges that must be addressed when considering ap- 
propriate monetary mechanisms. The following three monetary incentive mechanisms 
are well studied solutions to various crowdsensing schemes. 

— The task-reverse-auction incentive mechanisms [738, 762, 766] allows for a set of 
users to bid on the set of tasks posted by the platform. Each bid represents a promise 
to finish a task provided that the platform will pay the user the bid value. Naturally, 
the user who bids the lowest price wins and gets the opportunity to perform the 
sensing task. 

— Inthe case of the data-reverse-auction incentive mechanisms [356, 381], a set of 
users auction their sensing data for the posted set of tasks and their prices per 
already finished tasks. Then, the platform selects the data that satisfies its criteria 
and pays the users their bid price. 

— The platform-centric model [743] treats the crowdsensing problem as a Stackelberg 
game. The reward of the task is changed until the platform and users reach a Nash 
equilibrium. 


4.6.2.2 Examination of Three Typical Incentive Mechanisms 
Most of the existing incentive mechanisms fall under game theoretical models. In this 
subsection, we examine the three aforementioned mechanisms. 

The main problem of the task-reverse-auction approach [738, 762, 766] is that, 
because of untruthful bids, the auction style does not always select the nearest user 
to complete the sensing tasks [766]. In this situation, the user who is far away from a 
sensing task can win the auction. Further distances result in a longer travel time for 
users. Thus, the task-reverse-auction incentive mechanisms need more time to complete 
all the sensing tasks than our incentive mechanisms. 

For data-reverse-auction incentive mechanisms [356, 381], while multiple users 
collect the data for one sensing task, only one user’s data can be accepted by the 
platform. In other words, other users’ data is wasted. As a result, this type of incentive 
mechanisms increases costs for car fuel, personal free time, etc. For our incentive 
mechanisms, users can select the sensing task before they go to collect the data. Thus, 
the cost explosion problem can be avoided. 

The platform-centric model [743] assumes that the platform has no upper bound 
on budget budget. Therefore, it can find an optimal solution to giving the platform the 
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highest-quality data available. In practice, the platform usually has a limited budget 
and may not be able to obtain the game theory equilibrium reward. Our incentive mecha- 
nisms provide heuristics to obtain data while working under tighter budget constraints, 
and can be considered more consistent with an online crowdsensing scheme. 


4.6.3 Incentive Mechanism Research Problem and Its Model 


For our research problem, the platform needs to motivate the users of the platform to 
collect the road pavement data constraint to a budget and target area. In this case, our 
research objective is to design an appropriate incentive mechanism to help the platform 
achieve an area coverage target with a low cost and total operation time. Based on 
the comparison results of incentive mechanisms, the platform can choose the best 
incentive mechanism with the lowest budget for different area coverage targets. 

Our model of the research problem contains three entities: the environment, the 
sensing task, and the user. Each entity can be described by its behavior and/or its 
relationship with other entities: 

— The environment entity based on the Manhattan model; it is a grid of cells without 
loss of generality for incentive mechanism studies. The grid has a uniform cost 
distribution for traveling across adjacent cells, and no missing cells within. The 
environment represents the types of roads that users may encounter and the varying 
costs of traveling with different pavement conditions. Lastly, user entities can 
transfer their position only to one orthogonal cell per unit of time; users cannot 
move diagonally. 

— The sensing task entity contains information on the location of interest and the 
monetary incentive associated with user participation. The sensing tasks specify 
roads where pavement sensing is needed. 

— The user entity represents users participating in the crowdsensing scheme. As users 
continue to collect and report data for rewards, they accumulate monetary rewards 
and endure operation costs. 


4.6.4 Incentive Mechanism Solutions 


Modularity and scalability are critical features needed in designing a crowdsensing 
framework for deploying and testing incentive mechanisms. It would be difficult to 
swap incentive mechanisms and evaluate them without these features. Our crowdsens- 
ing platform and incentive mechanism designs are guided by the evaluation metrics 
described in this section. 
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4.6.4.1 Notations 

The symbols we use in this report are shown in Table 4.6. Two important variables 
in our model are sj and u;. They represent identification numbers of the sensing 
tasks and users. The tasks s; and the users u; have attributes < ui, Ry, Xj, yj > and 
< Sj, aj, Cy, Xi, Yi >, respectively. For users, if sj is O or -1, then the user is currently 
not participating because the user has not selected a sensing task or has dropped out. 
For sensing tasks, if u; is O then the sensing task has not been assigned to a user. In 
addition, if a sensing task has a reward equal to 0, then its reward has been claimed. 


Tab. 4.6: Common symbols. 


Symbols Meanings 
aj Accumulated reward of user u; 
Avg; Average distance from task s; to all users 
B Budget for the platform 
BR Base reward 
b The side length of the grid 
Ci The travel cost for u; to complete s; 
CR The reward of the task that offers MP 
dj,uc Distance from s; to uc 
dj,tc Distance from s; to tc 
IM Incentive mechanism 
ki The ranking number for u; 
MP Maximum profit for user u; 
NPM Net profit margin 
P Area coverage percentage 
Pij Profit for u; of sensing task sj 
PC The platform cost 
Ri Reward of the sensing task s; for user u; 
(S) s; (Set of) Sensing task/ID 
Sa The set of available tasks 
SID The index of task selected by user u; 
Sr The percentage of trials succeed 
T Threshold for net profit margin 
tc The center of locations of sensing tasks 
ty Total operation time 
(U) u; (Set of) User/ID 
uc The center of locations of users 
Xi x-coordinate 
Yi y-coordinate 
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4.6.4.2 Evaluation Metrics 
The purpose of the evaluation metrics is to differentiate the incentive mechanisms 
and to guide the design of the crowdsensing solutions. The simulations for incentive 
mechanism evaluations consist of an extensive number of trials. In each trial, we 
initialize the tasks and users at the beginning and the simulation runs until all tasks 
are completed or all users drop out. The details of the evaluation metrics are described 
as follows: 

- The total operation time t; represents the duration ofa trial. In one trial, a timer 
starts from time 0 and ends at the time ts when all sensing tasks are completed or 
all users drop out. While two incentive mechanisms may have an equal success rate 
sr, one incentive mechanism might have less total operation time tr. This implies 
that users have been incentivized to select and perform tasks efficiently. 

— The platform cost in Equation 4.12 is the amount of money that the platform pays 
the users through sensing task rewards. The surplus is the portion of the budget 
that is not used by the end of a trial. A lower platform cost reflects the ability of 
incentive mechanisms to reduce the cost of sensing task rewards. 


PC = B - surplus. (4.12) 


4.6.4.3 Platform-Driven Greedy Algorithm 

The platform-driven greedy algorithm that we use to design our incentive mechanisms 
is shown in Algorithm 4. The idea of this algorithm is to select an available task that 
gives the maximum profit to the user. Thus, this platform-driven greedy algorithm 
computes the gain of task s; to user u; by Equation 4.13. 


Pi = Rij- Ci (4.13) 


in which Rj; is determined by the incentive mechanisms. We will describe more details 
of Rj; in the following subsection. After this algorithm finds out the task s; which can 
provide the maximum profit for user u;, the user u; needs to check if the net profit 
margin of the task s; is greater than the threshold T. If positive, the user u; selects the 
task; otherwise, the user u; drops out. 


4.6.5 Incentive Mechanisms 


We will cover nine unique incentive mechanisms, each with unique characteristics. The 
task-reverse-auction (TRA) incentive mechanism has been discussed in the literature 
[738, 762, 766]. It is known that the task-reverse-auction incentive mechanism cannot 
guarantee that all tasks are completed within a short total operation time in untruthful 
bid scenarios [766]. Our incentive mechanism design has a goal of reducing the total 
operation time. Thus, we will compare their total operation times in Section VI. The 
other eight incentive mechanisms are described as follows. 
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Algorithm 4: Platform-driven greedy algorithm 


1 
2 
3 
4 
5 
6 
7 
8 
9 


10 

11 
12 
13 
14 
15 
16 
17 
18 
19 
20 
21 
22 
23 
24 
25 


Input: uj, Sa, T where u; =< Sj, ai, Cij, Xis Yi > 

Output: Updated u;.s; 

if Sa == 0 then 
u;.S; = —1; // user u; drops out as no task is available 
return; 

end 

MP = —œ, CR = -09; 

for sj in Sq do 

Pij = Ry - Cij; 

if P;; > MP then 
MP = Pij; 
CR = Ry; 
S= Sj; 

end 

end 

if Ui.đi == O then 

NPM = 100 x MP+eR, 


CR 
else 
MP+ui.di. 
NPM = 100 x art 
end 
if NPM < T then 


ui.Sj = —1// u; drops out as no task gives ample profit; 
return; 


end 

Ui. = U;.a; + CR; 
Ui.Sj = S; 

return; 
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Static Uniform (SU) Incentive Mechanism In the static uniform incentive mecha- 
nism [524], the incentives of sensing tasks are fixed values that are uniformly distributed 
and have the value R;j calculated by Equation 4.14. In this case, Rj; is set to the base 
reward BR. 

= BR (4.14) 


Dynamic Relative (DR) Incentive Mechanism The incentives change their values 
Rij based on the distance from currently unavailable users and the user u; to the sensing 
task s;. This incentive mechanism ranks the currently unavailable users and user u; by 
their distance to the sensing task s; in an increasing order. Then, the value of incentive 
for the sensing task s; can be calculated by Equation 4.15. 


BR 


BRCO- 3) 


(4.15) 


Dynamic/Static User-Centric (DUC/SUC) Incentive Mechanisms First, the cen- 

ter of user locations is calculated by Equation 4.16. Then we compute the distance 

ds,uc from the task s to the user center using Equation 4.17. The value Rj; is inversely 

proportional to the distance as shown in Equation 4.18. 

— Static case: rewards of sensing tasks are computed only once at the beginning of 
each trial. 

— Dynamic case: like the static case, but the calculation repeats whenever a user is 
about to select a sensing task. 


= Dieu Xi Dieu Vi 
(Xue, Yuc) ( [U] ’ [Ù] ) (4.16) 
ds,uc = [Xs = Xuc| + [Ys = Yuc| (4.17) 
1ds, 
Rj =BRO -3 BED) (4.18) 


Dynamic/ Static Task-Centric (DTC/STC) Incentive Mechanisms This mechanism 

first computes the center of the locations of sensing tasks, i.e. tc, by Equation 4.19. It 

then calculates the distance ds; t+- from the sensing task s to the sensing task center 

by Equation 4.20. Finally it derives the value Rj; by Equation 4.21, which is inversely 

proportional to the distance. 

— Static case: rewards of sensing tasks are computed only once at the beginning of 
each trial. 

- Dynamic case: like the static case, but the calculation repeats whenever a user is 
about to select a sensing task. 
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ses Xs X ses Ys) 


(Xtc, Ytc) = ( is)? |S] (4.19) 
ds,tc = |Xs — Xtc| + |Ys — Ytel (4.20) 

1 dst 
Ry = BRO - 5555) (4.21) 


Dynamic/Static Pit (DPIT/SPIT) Incentive Mechanisms In this pit-based incentive 

mechanism, we use all the users’ coordinates to calculate an average distance to the 

sensing task s by Equation 4.22. Then, we compute the incentive R;; of the sensing task 

s by Equation 4.23. 

— Static case: rewards of sensing tasks are computed only once at the beginning of 
each trial. 

— Dynamic case: we need to recalculate the incentives when a user is about to select 
a sensing task. 


z (xs = Xi] + [ys — yil) 


avgs iU] (4.22) 
BR av 
Rj = + E) (4.23) 


4.6.6 Incentive Mechanism Simulation 


This section describes the parameters and the processes that have been used in the 
simulations for the performance study of incentive mechanisms. 


4.6.6.1 Parameters 

The parameter tuple for each trialis < B, P, IM >. After simulations, the evaluation 

metric tuple < ts, PC > will be averaged across the total number of simulation trials. 

In our simulation, the unit of time and money are time unit and fiat unit. Here is the 

description of the parameters of the experiments: 

— Budget B represents the quantity of money that allows the platform to use in 
a trial. For this experiment, 100 data points were collected in the interval B € 
[100.00, 1090.00] with 10.00 spacing between each data point. 

— Area coverage percentage P represents the percentage of the area that requires 
sensing data. As with the budget, 100 trials were conducted such that P € 
[20.0 %, 79.4 %] with 0.6 % spacing between each data point. This interval repre- 
sents a wide range of possible target percentages for pavement crowdsensing. Note 
that we round down the area coverage percentage when calculating the number of 
tasks. 

— The final parameter is the incentive mechanism IM used in the trial. The different 
IM calculate rewards of tasks differently. 
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4.6.6.2 Simulation Execution 

Given < B, P, IM >, the construction phase initializes the numbers of cells, users, and 

sensing tasks in the following order: 

— For each cell, any references to users or sensing tasks are cleared. 

- Forall users, sj, a;, Cj, Xi, and y; are initialized. sj is set to 0. Each user would be 
placed in a cell randomly without overlapping. 

- Forall sensing tasks, u;, Rj, xj, and y; are initialized. u; is set to 0. Each sensing 
task will be randomly placed in a cell with no overlap between other sensing tasks. 


In the execution phase, available users start their turns by selecting and committing to 
a sensing task based on Algorithm 4. Then, the user will update its s;. In turn, the user 
information associated with the sensing task sj will be updated to reflect that the user 
u; now performs task s;. If no suitable sensing task is found, then the user drops out of 
the trial for all future turns. Unavailable users are the ones who have not dropped out 
and commit their turns by moving towards their sensing tasks. If the user lands on the 
sensing task, then a; increases by Rj. If the user is not on the sensing task, then the 
user must wait another turn to move closer. In both cases, Cj, Xi, and y; are updated to 
reflect the current user location. 


4.6.7 Incentive Mechanism Evaluation Results 


Incentive mechanisms are evaluated and compared in three scenarios corresponding 
to low, medium, and high area coverage percentages for pavement crowdsensing with 
different numbers of users. The platform cost is used to order the incentive mechanisms 
based on their performance data, as shown in the following figures. The minimal 
budgets shown in the figures are the lowest budgets that can realize a 100 % success 
rate for the targeted area coverage percentage. It means that any budgets higher than 
this value allow the platform to achieve a 100% success rate for the targeted area 
coverage. 


4.6.7.1 Platform Cost Comparison 

In this section, we discuss the comparison of incentive mechanisms in terms of the 

platform cost. 

— Given 25 % area coverage, Figure 4.38 shows that the SU and DR incentive mecha- 
nisms have the lowest platform costs when the platform has 3 users and 15 users, 
respectively. Apart from this, the static and dynamic pit incentive mechanisms 
consistently rank among the top three incentive mechanisms in all scenarios. 

— Given 50 % area coverage, Figure 4.39 shows that static and dynamic pit incentive 
mechanisms still have the best performances of the platform cost in all scenar- 
ios. Even though the DTC incentive mechanism achieves the lowest platform cost 
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in 50 % area coverage with 45 users, this observation does not conflict with the 
previous statement. 

— Given 75 % area coverage, Figure 4.40 shows that SPIT and DPIT always have the 
lowest platform cost regardless of how many users the platform has. 
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Fig. 4.38: Incentive mechanism comparison: 25 % area coverage. 


Based on the observations described above, we can conclude that SPIT and DPIT are 
the two incentive mechanisms with the lowest platform cost. 


4.6.7.2 Total Operation Time Comparison 

In this subsection, we discuss the comparison of incentive mechanisms in terms of 
the total operation time. From Figs. 4.38, 4.39, and 4.40, the total operation time of 
the task-reverse-auction (TRA) incentive mechanism is nearly twice the total operation 
times of ours. Additionally, the total operation time of the TRA incentive mechanism 
becomes longer as the number of participatory users increases while the total operation 
times of our incentive mechanisms would decrease in the same situation. This result 
proves that our incentive mechanisms have much less total operation time than the 
Task-Reverse-Auction (TRA) incentive mechanism. 
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Fig. 4.39: Incentive mechanism comparison: 50 % area coverage. 
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4.6.8 Machine Learning Augmentation 


Based on computer vision, crowdsensing and machine learning can be applied to 
pavement distress monitoring. For example, we have used a state-of-the-art machine 
learning model for detecting pavement damages based on images captured by the 
Android-phone camera and classifying them into eight types with corresponding confi- 
dence [738]. The types include (i) Liner crack, longitudinal, wheel mark part, (ii) Liner 
crack, longitudinal, construction joint part, (iii) Liner crack, lateral, equal interval, 
(iv) Liner crack, lateral, construction joint part, (v) Alligator crack, (vi) Rutting, bump, 
pothole, separation, (vii) White line blur, and (viii) Crosswalk blur. We have chosen 
this machine learning model because it “achieved recalls and precisions greater than 
75 % with an inference time of 1.5s on a smartphone.” [738] 

Moreover, machine learning has been used for incentive mechanisms in embedded 
crowdsensing applications. For example, neural network and clustering algorithm have 
been applied for user grouping in [427] and the resulting incentive mechanisms can 
reduce the social cost, overpayment ratio, and grouping time. In the following sections, 
we discuss general considerations of machine learning augmentation, supervised 
learning, and unsupervised learning for incentive mechanisms. 


4.6.8.1 General Considerations 
We have witnessed great strides in the development of machine learning and its ap- 
plications in recent years. Work in the spaces of image classification, text generation, 
language translation, and generative adversarial networks have produced results that 
could only be described as magic to the untrained eyes. Furthermore, being able to 
harness the full potential of these techniques, including subsequent derivatives, will 
be the aim of research for the foreseeable future. Such tantalizing thoughts act as moti- 
vation to incorporate different machine learning algorithms within the framework of 
crowdsensing, and our discussion is no exception. The crowdsensing scheme can be 
augmented using both supervised and unsupervised learning. 

— Supervised learning focuses on mapping input data to target classes. Typically 
the input data will be sampled from a dataset where particular data points must 
be categorized. One common supervised learning task is image recognition. For 
example, one may use a convolutional neural network that receives input data in 
the form of road pictures and produces output data in a string distinguishing the 
road condition [738]. 

— Unsupervised learning focuses on clustering datasets such that embedded classes 
may be revealed. These class embeddings may reveal subsets of data and help 
highlight underlying relationships introspectively. One common unsupervised 
learning task is dimensionality reduction. In this case, we may consider a dataset 
with copious amounts of features. Using a clustering algorithm, such as spectral 
clustering, we may be able to reduce the principal features required to character- 
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ize data points uniquely. In other words, we may only need a proper subset of a 
dataset’s features to represent the grouping of a data point. These transformations 
may lead to lossless decomposition of our data. Thus, we can reduce memory stress 
for supervised learning tasks. 


4.6.8.2 Supervised Learning for Predicted Budget 

One apparent shortcoming between all covered works in this contribution is the pre- 
dicted budget. Having incentive mechanisms work under a limited budget is only 
reasonable if the budget has been methodically selected. In this case, a supervised 
learning problem is clearly established. Either a classification of the simulation param- 
eters or a forecast of user costs can be used to determine a predicted budget. In the first 
case, the input data would include basic simulation parameters, such as the number 
of users and percentage area coverage, and the output would be the predicted budget. 
In the second case, the simulation would provide a seed state, including coordinates 
of users and sensing tasks, as input and calculate the final board statistics in terms 
of overall operation costs. The overall operation costs would be correlated with the 
predicted budget. Previous simulations would act as the data needed to construct these 
models in either case. 


4.6.8.3 Unsupervised Learning for Incentive Mechanisms 

Although the different incentive mechanisms studied in this contribution showed var- 
ious levels of effectiveness in finishing the sensing tasks, a question of interest in 
crowdsensing applications is how their performances might differ if the incentives 
follow a non-uniform or random distribution. One may model the environment of 
the crowdsensing scheme as scattered normal distributions where sensing tasks may 
cluster in different neighborhoods. This scenario is realistic in rural living where com- 
munities may be sparse apart but dense around some centroid. The covered incentive 
mechanisms require a new component to scale incentives across clustered communities 
effectively. A clustering algorithm could locate the centroid of sensing task clusters and 
calculate rewards relative to these neighborhoods for the unsupervised learning task. 
In this case, the incentive mechanisms could scale to any size of environment given 
sufficient resources. 


4.6.9 Conclusion and Future Work 


In this contribution, we proposed eight incentive mechanisms based on a platform- 
driven greedy algorithm to help the crowdsensing platform motivate users to collect 
pavement condition data. Since our incentive mechanisms allow users to select the 
sensing tasks based on a platform-driven greedy algorithm before they start to collect 
the data, they can avoid the cost explosion problem observed in the data-reverse- 
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auction incentive mechanisms. From the simulation results, we find that SPIT and DPIT 
are the incentive mechanisms that have the lowest platform cost. Compared with the 
task-reverse-auction incentive mechanism, our incentive mechanisms reduce the total 
operation time by half. Our future research includes large-scale simulations and real- 
life experiments by extending our prototype pavement crowdsensing system. Lastly, we 
discussed machine learning augmentations for embedded crowdsensing applications 
and different incentive mechanisms. 
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5.1 Capacity Analysis of loT Networks in the Unlicensed Spectrum 


Stefan Böcker 
Christian Arendt 
Christian Wietfeld 


Abstract: The ongoing digitalization and the steadily increasing number of distributed 
sensor devices and Internet-of-Things (IoT) systems implies a massive increase of sub- 
scribers. At the same time, the amount of available frequency spectrum resources 
remains static. In this respect, current 5G networks are already aiming for large-scale 
connectivity with an ambitious node density of 1000 000 devices per square kilometer 
in the area of massive Machine Type Communication (mMTC). A huge number of po- 
tential technology solutions are available, but a comprehensive networking solution 
based on one technology seems unlikely. Among typical cellular IoT technologies, these 
challenging 5G mMTC requirements are also addressed by a growing number of unli- 
censed technologies enabling a simple, cost-effective network operation independent 
of licensed operators. 


In this context, the potentials of Low-Power Wide Area Networks (LPWAN) technolo- 
gies, as an additional technology option in unlicensed frequency bands are analyzed. 
Specifically, this work aims to analyze the suitability of LORaWAN to contribute to given 
5G requirements for specific mMTC applications in large-scale deployments. The per- 
formance evaluation illustrates that LoORaWAN is attractive due to high communication 
ranges up to multiple kilometers, enabling a high coverage even with a small number 
of cells. The evaluation also finds that the technology has a high potential to contribute 
to 5G mMTC application areas, especially for non-time-critical sensor applications. 


To further increase the reliability of LPWAN systems, especially for critical services, 
different approaches to increase spectral efficiency are discussed. In addition to purely 
scheduling-based approaches, a data-driven analysis of the spectral power density 
to predict and avoid technology-independent interferences is presented. This is used 
to increase the robustness of LPWAN systems by centrally deriving communication 
profiles that address and bypass the predicted interference characteristics. 


Apart from intelligently scheduling data transmission, another way for increasing 
efficiency is to reduce the amount of data that has to be transferred in the first place. 
Nowadays, initial generations of connected IoT devices and applications enabled by 
Cellular-IoT (CIoT) and LPWAN technologies are deliberately kept simple and based on 
equidistant, regular communication intervals. By contrast, we illustrate an Artificial 
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Fig. 5.1: Achieving the 5G scalability targets for loT environments 


Intelligence (AI)-based model-predictive communication approach taking advantage of 
knowledge about the underlying data of a sensor system. An AutoRegressive Integrated 
Moving Average (ARIMA) model is used in order to depict the behavior of an applications 
sensor data, leaving only values deviating from the model to be transmitted. 


5.1.1 Introduction 


Wireless connectivity has become a ubiquitous part of daily life. While wireless networks 
were originally developed to connect people, cellular networks have evolved to enable 
Machine-to-Machine (M2M) communication. In a wide variety of application areas, 
such as smart cities, energy systems, or production and logistics, devices are linked 
to each other to enable fully autonomous operation without human intervention. In 
this context, the 5G specification defines an mMTC requirement profile that aims for 
an ambitious scalability target of 1 million subscribers per square kilometer, while 
maintaining a maximum latency of 10 s (see Figure 5.1). At the same time this scalability 
target is linked to a boundary condition, that considers a pre-defined Poisson arrival 
process traffic pattern for non-full buffer systems with a payload of 32B. 

While Section 4.3 presents the performance evaluation of the NB-IoT technology 
as a current 3GPP solution to address 5G mMTC requirements, this contribution covers 
the research challenge to identify complementary technologies operated in unlicensed 
frequency bands to contribute to tight 5G mMTC requirement profiles. To this end, this 
work first discusses potentials and limitations of the Long-Range Wide-Area Network 
(LoRaWAN) technology, as a representative of LPWAN solutions, in order to subse- 


5.1 Capacity Analysis of loT Networks in Unlicensed Spectrum —— 315 


quently introduce optimizations for further performance enhancements of unlicensed 
technology solutions with specific respect to mMTC applications. 


5.1.2 Opportunities and Challenges of the Licensed and the Unlicensed Spectrum for 
loT Environments 


First, this section introduces the necessary LORaWAN fundamentals from which the 
opportunities and challenges are derived. In addition, the impact of regulatory policies 
on the performance of the LoORaWAN technology is discussed. 

LoRaWAN is an LPWAN specification for wireless battery-powered systems in a 
regional, national, or global range. It is based on the LoRa Modulation technique and 
mainly operated in the Short-Range Device (SRD) band at around 868 MHz in Europe 
and 915 MHz in the US. LoRaWAN enables a wide-range communication even in urban 
scenarios and provides a very good deep indoor penetration [203],[304]. The definition 
of the spreading factor (SF) permits the trade-off between efficient and very robust 
communication, whereby the data rates vary from 0.25 kbit/s (SF=12) to 5.5 kbit/s (SF=7). 

Because LoRaWAN is operated in unlicensed frequency bands, the channel access 
must comply with regulatory frequency band requirements that ensure that all par- 
ticipants have equal access to frequency resources. For the underlying SRD band, the 
European Commission in cooperation with ETSI allows mitigation techniques such 
as Listen Before Talk (LBT), detect and avoid (DAA), and duty-cycle limitations [192], 
whereby LoRaWAN relies on a fairly simple pure ALOHA channel access and imple- 
ments the duty cycle limitations to meet regulatory ETSI requirements. Thus, peak 
physical data rates are further reduced by a factor of more than 99 % due to the regula- 
tory impact of a given duty cycle of 1% and a MAC overhead with minor impact. The 
resulting average throughput ranges from 1.5 to 48 bit/s (as shown in Figure 5.2). 

Consequently, throughput limitations are mostly driven by the idle time (time 
off) following the transmission time per packet (time-on-air), which is required to 
meet duty cycle limits [446]. The LoRaWAN specification defines three mandatory 
channels: 868.1 MHz, 868.3 MHz, and 868.5 MHz, additional resources are optional (see 
Figure 5.3). To reduce interference, channels are cycled in a pseudo-random approach. 

As shown in Figure 5.3, the duty-cycle constraints apply to each SRD sub-band and 
may vary between different sub-bands, i.e. a dedicated LORaWAN downlink communi- 
cation is deployed at 869.525 MHz with a duty cycle of 10 % and additional limiting to 
data-rate class 0 (DRO). 


5.1.3 Capacity Limits of LPWAN Technologies in Unlicensed Band Operation 


The determination of capacity limits is based on a performance evaluation derived from 
an analytical model [446] that has been enhanced fundamentally for the underlying 
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work. The resulting analytical model enables demand-based derivation of key perfor- 
mance parameters, such as data rate and coverage area. The main extension aims at 
deriving latency bounds for different scalability scenarios and large-scale deployments, 
enabling the simultaneous determination of service guarantees. 

Because downlink communication between a LoRaWAN gateway and distributed 
LoRaWAN nodes is not interfered with by uplink communication, performance limits 
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can be easily determined by evaluating a maximum capacity of a single LORaWAN 
link. Considering the mandatory duty-cycle constraint (see Section 5.1.2), this results in 
overall limited downlink performance margins. Based on the assumption of Class A 
LoRaWAN nodes, Figure 5.4 depicts the average downlink throughput of a LORaWAN 
network. As illustrated, the LORaWAN downlink for Class A devices is based on two 
consecutive downlink receive windows. Following an uplink message, a Class A end 
device opens a first receive window (RX1) typically one second later. The first receive 
window is opened one second after termination and on the same frequency channel as 
the previous uplink message. The second receive window (RX2) is typically opened 2s 
after uplink transmission and based on a dedicated downlink channel at 869.525 MHz 
with a mandatory duty cycle of 10% and a limitation to data rate class DRO. This 
results in a low average downlink throughput of about 31.2 bit/s, which excludes a 
large number of downlink-intensive and safety-critical applications, such as update or 
upgrade functions. 
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Fig. 5.4: System throughput (downlink) utilizing maximum duty-cycle capabilities ©[2019] IEEE. 
Reprinted, with permission, from [93]. 


In the uplink communication direction, the pure ALOHA channel-access scheme is 
implemented. In this context, the parallel uplink communications of different spreading 
factors are orthogonal to each other, permitting each data rate class to be modeled as an 
independent ALOHA process. The maximum system throughput of each data rate class 
is closely related to the number of devices and can be determined using the known 
ALOHA model equation S = G - e-?°, whereby S is the normalized channel throughput 
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and G the channel traffic [8]. The integration of the capture effect takes into account 
that two parallel transmissions with an Received Signal Strength Indicator (RSSI) delta 
greater than 6 dB for co-channel rejection [592] do not interfere with each other. In this 
case, only one packet is dropped, and the packet wirh higher received power is assumed 
to be decoded successfully, resulting in increased scalability. ALOHA equations can 
be adapted to S = ee - (1 + ef) [140]. In order to allow the derivation of guaranteed 
performance in addition to maximum uplink system throughput or scalability, the 
existing model is extended to incorporate the derivation of latency bounds, whereby 
latency is defined as the sum of Time-on-Air (ToA) and forced time off (T p) by regulatory 
duty-cycle restrictions. Consequently, every interfered transmission increases the time 
on air ToA+T og. Therefore, the mean latency Tpp can be defined based on the maximum 
packet collisions per transmission g as depicted in the following equation: 


=G 
TDR ic 7 (ToA a Tog) = Tof 


(e°® - 1) : (ToA + Top) + ToA (5.1) 


Equation 5.1 is enhanced to consider the 99 %-quantile of defined 5G mMTC latency 
requirements To9%. 


T 99% = l081—e-26 (0.01) - (TOA + Tog) + ToA (5.2) 


The derived model can be modified and configured depending on desired parameter 
scenarios. Figure 5.5 illustrates the results for an exemplary configuration of a 32B 
payload, 3 channels, and the maximum duty cycle. Without considering the capture 
effect, the maximum throughput of approximately 3.3 kbit/s can be achieved with a 
fleet size of about 900 subscribers. This can be further increased by about 50 % due to 
the additional consideration of the capture effect, resulting in a maximum throughput 
of about 4.7 kbit/s for a uniformly distributed number of subscribers of 1350. However, 
taking into account the limiting data-rate class DRO, this is simultaneously accompanied 
by an increased average latency of 400 s, which corresponds to an increase of about 
12.5%. 


5.1.3.1 LoRaWAN Contribution to 5G 

LoRaWAN technology is emerging as a very good solution in the unlicensed spectrum 
band to support 5G mMTC targets. Although LoRaWAN does not support the required 
164 dB with a loss of 151 dB, it can cover the targeted area of one square kilometer, which 
is defined for the 5G mMTC connection density target. At any rate, it has a very good 
range and deep indoor availability even for urban areas [304]. Furthermore, Figure 5.6 
illustrates the impact of various LoRaWAN parameter configurations on maximum 
scalability in view of 5G mMTC connection density and latency requirements (see 
Section 5.1.1). It can be shown that the 5G mMTC parameter set results in a significant 
contribution of 10 % for three 125 kHz frequency channels, which can be increased up 
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Fig. 5.5: System throughput (uplink) and latency utilizing maximum duty cycle capabilities ©[2019] 
IEEE. Reprinted, with permission, from [93]. 


to 25 % when the 8 channels are considered to cover 5G mMTC targets of one million 
devices per square kilometer. However, these results are obtained without considering 
the 5G mMTC latency requirement of 10s. If the latency requirement is taken into 
account to ensure a certain quality of service, scalability is reduced by 25 % for a 50 %- 
quantile or up to 70 % under consideration of a 99 %-quantile. 

When deviating from the 5G mMTC traffic pattern and considering other payload 
sizes, it can be seen that this factor has only a minor impact on scalability. By contrast, 
the variation of the transmission interval has a fundamental effect. In the case of a 
low transmission interval of only 12 hours, more than 50 % can be met of the overall 
connection density of one million devices per square kilometer without consideration 
of a latency requirement (99 %-quantile). By contrast, almost 20 % can be met when 
taking into account a latency requirement. So far, the results described have been 
focused on the assessment of unacknowledged traffic in the uplink direction. When ACK 
packets in the downlink direction are included, the downlink indicates a significantly 
constrained scalability of the LORaWAN network. Even for a transmission interval of 
12 hours, the scalability decreases to about 14 250 subscribers per square kilometer, 
which corresponds to a reduction of 97 % and is consistent with the limitations of the 
LoRaWAN downlink. Overall, depending on the application scenario and configuration, 
a very significant contribution of LORaWAN technology to the 5G mMTC goals can be 
deduced, though the application field should be limited to non-time-critical sensor 
applications. 
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Fig. 5.6: Impact of various LoRaWAN parameter configurations on maximum scalability considering 
5G mMTC connection density and latency requirements ©[2019] IEEE. Reprinted, with permission, 
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5.1.4 Data-Driven Capacity Improvements 


In this section, optimization methods are presented that further enhance the perfor- 
mance of the previously discussed LPWAN technologies in unlicensed frequency bands. 


5.1.4.1 Dynamic Spectrum Management to Improve Scalability of Time-Critical 
Sensor Applications 

Due to their simple and cost-effective technical viability, a steadily increasing number 
of LPWAN are operated in unlicensed frequency bands. For each user, this leads to a 
large number of possible interference effects caused by a wide variety of technologies, 
each using different channel access methods. There is no central coordination as in 
licensed mobile radio frequencies. Despite mandatory interference mitigation tech- 
niques, quality of service in terms of availability, latency, etc. cannot be guaranteed 
due to uncoordinated channel access in unlicensed frequency ranges. In order to tackle 
these challenges, a data-driven spectrum management procedure is proposed (see 
Figure 5.7). This approach relies on SDR -based spectrum sensing to gather information 
on channel occupation, which is used to predict future spectral utilization. 

The predicted activity profiles are used for spectrum management in order to intelli- 
gently schedule future transmissions. Three scheduling approaches, namely Restricted 
Access Window, Weighted Restricted Access Window, and Coordinated Restricted Ac- 
cess Window, have been developed and evaluated. An overview of these approaches is 
shown in Figure 5.8, a brief description of each approach is given below. For evaluation, 
an externally defined Key Performance Indicator (KPI) is required, which is represented 
by the expected latency in this work. As LoRaWAN is the technology of choice, an 
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Fig. 5.8: Concept of the dynamic spectrum-management approach. 


ALOHA channel access is used as the basis and a duty cycle has to be observed. The 
expected latency is modelled as described in Section 5.1.3 

The three optimization methods Restricted Access Window, Weighted Restricted 
Access Window, and Coordinated Restricted Access Window are described below. 


Restricted Access Window (RAW) The basic idea and the name are derived from 
IEEE 802.11ah technology. The time ranges are first divided into no-go areas and random 
access areas. In this case, the channel is accessed randomly in areas below the drawn-in 
threshold and access is avoided in the no-go areas. Restricted Access Window is the 
simplest developed method. This method is a coarse-grained method, which means 
that the gateway does not determine the transmission time for each subscriber; rather, 
the gateway only transmits the appropriate transmission time through beacons and 
the subscribers randomly select a transmission time. 


Weighted Restricted Access Window In this method, the Restricted Access Window 
method is extended by a weighted access probability. This means the time ranges with 
an activity level below the tolerance limit have a higher chance to be selected. 
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Fig. 5.9: Comparison of the three scheduling approaches with regard to potential latency reduction. 


In this work, a quadratic functions is used as weighting and the weighting is calculated 
for each time point using Equation 5.3. 


W: = (A Tolerance 7 At)? (5.3) 


This method is a coarse-grained method and the gateway transmits matching trans- 
mission time points with their weights by beacons. The participants finally choose a 
transmission time point by a weighted random. 


Coordinated Restricted Access Window (CRAW) This procedure finds the optimal 
transmission time for each participant by assigning the time with the minimum activity 
to the first participant. Then the activity profiles are updated with the resulting activity 
by first participants. This procedure is repeated for the other participants. 

The cyclic steps of the procedure are as follows: 

1. finding the minimum activity in profile and assigning this time to the participant; 
2. updating the profile in view of the activity change at the assigned time point; and 
3. repeating the procedures for the next participant. 


Compared with the previous methods, CRAW is a fine-grained scheduling method. 
This means that the gateway must communicate the transmission time to each sub- 
scriber. Therefore, this method requires many resources. In CRAW, the influence of each 
participant on the activity is taken into account. 


Comparison of Scheduling Performance In this section, the three developed meth- 
ods are evaluated using the presented scenarios. The predicted daily profiles from the 
868 MHz study are used as a basis. In Figure 5.9, the results are shown and the used 
daily profiles are represented at the top. 
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Fig. 5.10: Model-predictive communication in Internet of Things environments ©[2020] IEEE. 
Reprinted, with permission, from [30]. 


It can be seen that the expected latency can be significantly reduced by using more 
intelligent scheduling mechanisms, which avoids times with high channel activity. 
Simply avoiding regions with a mean latency over a given threshold with the RAW 
approach sets the baseline potential, which can be optimized using a weighting to 
reduce the mean latency below 1s in this study. Providing more intelligence for the 
scheduling method with CRAW pushes all observed latencies below 1s. 

However, this approach generates an increased computational effort, which has to 
be taken into account as a trade off. 


5.1.4.2 Data-Driven Model-Predictive Communication 

In this section, we propose a data-driven approach to reduce communication efforts 
by leveraging knowledge about the underlying sensor data in IoT systems. In order to 
keep devices simple, the data transmission of IoT devices typically follows a regular 
pattern of equidistant time intervals. This leaves a high potential for optimizing the 
efficiency of spectral resource usage. Therefore, this study proposes a model-predictive 
communication framework that leverages knowledge about the underlying sensor 
data and allows IoT devices to rate the value of observations in order to decrease 
communication effort and free up spectral resources for other parties. This approach 
potentially increases the number of devices considered in the scheduling of resources 
in licensed frequency bands and reduces the likelihood of interference in unlicensed 
bands. Figure 5.10 depicts the concept of this work. 
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The underlying approach is not to transmit every measurement, only those that devi- 
ate significantly from a predetermined model. Therefore, two time series forecasting 
mechanisms are examined in this study to generate such a model of a temperature 
sensor setup. Because these methods simply rely on past data, the use case may be 
readily changed. This approach has been evaluated by using a dataset originating from 
an environmental indoor sensor located in a residential area in Dortmund, Germany. 
From the 1st of January 2019 until the 19th of November 2019, the system gathered 
temperature, humidity, and CO, concentration at a frequency of around 5 minutes. To 
minimize model complexity, the dataset was resampled to 30-minute time steps in this 
work. The raw data can be accessed via [29]. The forecasting algorithms used in this 
work are an autoregression based approach and a neural network approach, which are 
described in the following sections. Both models take advantage of a decomposition 
approach, which is discussed more below. 


Seasonal and Trend Decomposition Using LOESS (STL) In this work, the Seasonal 
and Trend decomposition using LOESS (STL) method [540] is used to extract typical 
properties of the underlying data, such as a daily profile for temperature data. The 
algorithm consists of two loops: the inner loop uses LOcally wEighted Scatterplot 
Smoothing (LOESS) [141] to extract the seasonal and trend components and the outer 
loop is used to minimize the impact of outliers by computing robustness coefficients. 
An example decomposition of a daily temperature profile from the dataset is shown in 
Figure 5.11. 
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Fig. 5.11: Decomposition of measured temperature data from 18th of August until 21st of August 
using STL ©[2020] IEEE. Reprinted, with permission, from [30]. 
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The example shows strong daily seasonality and a local weather trend. Additionally, 
a significant remainder is present that could not be related to either the trend or sea- 
sonality component. The decomposition can be used to aid forecast algorithms by 
subtracting the seasonal component before applying the prediction, and adding it back 
to the resulting time series afterwards. As the seasonal component typically changes 
slowly over time, it is possible to reduce the complexity of the analyzed time series in 
order to decrease prediction errors. 


AutoRegressive Integrated Moving Average (ARIMA) The ARIMA algorithm isa 
state-of-the-art time series forecasting method and one of the most widely used. It 
is composed of three components: the autoregressive part AR(p), entailing the past 
values of the original series; the integrated part I(d), related to differencing in order to 
make the time series stationary; and the moving average component MA(q) marking 
the model errors. The parameter set (p,d,q) defines the order of each component and 
therefore indicates the specific ARIMA model in use. In detail, p indicates the number 
of considered past values, d is the differencing degree and q specified the considered 
previous error terms. To simplify the application in this work, the forecast package for 
the statistical programming language R is used. The language contains an implementa- 
tion of the ARIMA algorithm, which allows automatic parameter set selection for every 
model realization [294], where a unit test procedure checking for stationarity is used to 
specify d, while p and q are found by minimizing Akaike’s Information Criterion (AIC) . 
A prediction for the 21st of August 2019 using ARIMA(0,1,0) based on a training period 
of three previous days together with the 95 % and 80 % prediction intervals as well as 
the actual measurement as test data is depicted in Figure 5.12 
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Fig. 5.12: Forecast for one sensor using STL + ARIMA(0,1,0) for the 21st of August 2019 with 95 % and 
80% prediction intervals ©[2020] IEEE. Reprinted, with permission, from [30]. 
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It is evident that a good match is achieved between the predicted and the measured 
temperature, with most of the measured temperature values lying inside the 80% 
prediction interval. 


Long Short-Term Memory (LSTM) As a variant of Recurrent Neural Networks (RNN) 
developed by Hochreiter et al. [277], LSTM networks have an improved ability to learn 
long-term dependencies in a dataset, which makes them appropriate for time series 
prediction tasks. The main structure of LSTM networks consists of concatenated cells 
that are linked together by constant cell states and the input flow. An input gate is used 
to regulate the influence of the cells input, while a forget gate filters previous cell states. 

The implementation used in this work is based on the well-known keras python 
framework with the Theano library as a backend. In order to keep the training duration 
managable and avoid decreasing the model accuracy by deeper networks as stated in 
[358], a single network layer is used. 50 LSTM cells were used within this layer, however 
the influence of the number of cells was seen to be minor. 

Figure 5.13 compares LSTM and ARIMA performance for an exemplary prediction 
based on the data of the 21st of August 2019 with the three previous days as an input in 
terms of required communication events with different tolerance ranges of 0.5 °C, 1°C, 
and 2 °C, respectively. The tolerance ranges depict the need for communication events, 
as only deviations higher than the tolerance should be transmitted. 
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Fig. 5.13: Forecast for one sensor using STL + ARIMA(0,1,0) and LSTM for the 21nd of August 2019 
with tolerance ranges of + 0.5 °C,+1°C and +2°C. ARIMA shows a slightly higher potential in decreas- 
ing communication effort for + 0.5 °C tolerance. ©[2020] IEEE. Reprinted, with permission, from 
[30]. 


ARIMA can achieve a potential reduction of communication events by 87 % for this day 
with a tolerance of 0.5°C, where LSTM provides a lower potential of 58 %. However, 
for applications where a higher tolerance of 1 °C is sufficient, both algorithms provide 
enough accuracy to save up to 100 % of the communication effort. 
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Statistical Performance Evaluation Both implemented approaches are statistically 
evaluated using the Root Mean Square Error (RMSE) as a metric for the modeling errors 
and for the potential reduction of communication events. To make both modeling 
concepts comparable, a sliding window approach as depicted in Figure 5.14 is used for 
validation. Each sliding set from the total dataset has a configurable number of input 
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Fig. 5.14: Walk forward validation with sliding window (constant training period length) in each 
algorithm’s variant. ARIMA takes t days of training data and produces a one day forecast that is 
compared to the test day, LSTM needs input/output pairs for training and is then tested on a held 
back test set. ©[2020] IEEE. Reprinted, with permission, from [30]. 


days followed by one output day. An advantage of ARIMA is the ability to sufficiently 
predict future data relying solely on a small data basis. LSTM, on the other hand, needs 
a large portion of input/output pairs as training data in order to learn essential features 
enabling the sensor data prediction. The impact of varying numbers of input days from 
3 to 12 on the prediction of one forecast day has been addressed in this work. A 10-fold 
cross-validation is used to further validate the forecast results, with a 90-10 % split 
between training and test data. The resulting modeling error of both approaches for 
varying number of input days represented by the RMSE distribution is illustrated in 
Figure 5.15. 

It can be seen that the impact of the input period length is small, but both models 
have a slightly lower error with smaller input periods. In general, LSTM provides a 
doubled mean error of around 0.3 °C when compared with ARIMA, producing a mean 
error of around 0.15 °C for all input period lengths. LSTM also shows a higher spread of 
the observed errors, except from a small amount of outliers experienced with ARIMA 
for longer input periods. The latter lead to the conclusion that data lying further in 
the past is less relevant for predicting the temperature values of the following day, 
increasing the amount of false forecasts observed. These results allow an estimation of 
the tolerance ranges in which the evaluated models can be used. A tolerance range of + 
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Fig. 5.15: RMSE for both approaches and varying training period. ARIMA shows a smaller RMSE 
than LSTM, however the mean error for both approaches remains nearly constant. ©[2020] IEEE. 
Reprinted, with permission, from [30]. 


0.5°C was therefore chosen as the minimum tolerance for evaluating the potential of 
the model-predictive approach. Two supplementary tolerance ranges of + 1°C and + 
2°C were evaluated to show the dependence of performance on the chosen tolerance 
for different applications. This analysis is carried out for the underlying temperature 
sensor system and depicted in Figure 5.16. 
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Fig. 5.16: Potential reduction in communication effort provided by using ARIMA and LSTM depending 
on varying input periods and tolerance ranges. ©[2020] IEEE. Reprinted, with permission, from [30]. 
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A potential reduction of communication effort by more than 60 % with a tolerance 
of + 0.5 °C can be observed with both models, with ARIMA reaching more than 80 % 
potential reduction. LSTM performance decreases with a larger input period, which 
can be explained by a reduced number of input/output pairs available for training. As 
expected from the error distribution in Figure 5.15, both models perform almost equally 
well for higher tolerance ranges with more than a 90 % reduction for + 1°C tolerance to 
nearly no communication effort for + 2°C tolerance. 

Finally, both modeling approaches have the potential to cut sensor systems’ com- 
munication effort significantly. Due to its superior overall results and a much higher 
efficiency in terms of input data needs and computational effort, the ARIMA algorithm 
is the favored method. 


5.1.4.3 Outlook and Future Work 

Even requirement profiles of current 5G mMTC applications are in a state of constant 
evolution. While classic network dimensioning is largely based on stochastic behavior 
and correlated traffic volumes, the share of event-driven machine communication will 
increase dramatically. The key challenge will be the realization of reliable critical alarm 
communication in the face of unpredictable behavior. In this context, a requirement 
migration towards mission-critical 6G MTC networks is inevitable [429]. To ensure that 
a guaranteed quality of service can be achieved for application classes that can barely 
be predicted with a reasonable amount of resources, the available resources must be 
allocated dynamically and as a function of defined costs (frequency utilization, delay 
times, energy consumption, ...). For this purpose, the definition of new service classes 
is needed defining targeting latency, Block Error Rates (BLER), and its service character- 
ization, and highlighting the need for future 6G systems to leverage application-domain 
information about the predictability of resource requirements and conditions. The new 
service classes are shown in Figure 5.17. 
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Fig. 5.17: Requirement migration towards mission-critical 6G MTC networks [429]. Used under CC BY 
4.0 (https: //creativecommons.org/licenses/by/4.0/). 
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5.2 Resource-Efficient Vehicle-to-Cloud Communications 


Benjamin Sliwa 


Abstract: Big vehicular data is anticipated to become the new fuel for catalyzing the 
further development of connected and autonomous driving. Vehicles themselves will 
act as mobile sensor nodes that actively sense their environment and gather meaningful 
data for novel crowdsensing-enabled services such as the distributed generation of 
high-definition maps, traffic monitoring, and predictive maintenance. However, the 
implied tremendous increase in massive Machine-Type Communication (mMTC) rep- 
resents an enormous challenge for the coexistence of different resource-consuming 
applications and entities within the limited radio spectrum. A promising approach for 
achieving relief through a more resource-efficient usage of existing network resources 
is the utilization of client-based intelligence. Novel communications paradigms such 
as anticipatory mobile networking aim to improve decision processes within wireless 
communication systems by explicitly taking context information into account. In the 
context of vehicular crowdsensing, these methods exploit the delay-tolerant nature of 
the targeted applications for scheduling the data transfer with respect to the expected 
resource efficiency. If the current radio channel and network load conditions do not 
allow a resource-efficient transmission, the data transfer process is postponed and the 
acquired data is aggregated locally in favor of a better transmission opportunity in the 
near future along the expected vehicular trajectory. 


In the following, the different evolution phases of the novel Channel-aware Transmis- 
sion (CAT) scheme are presented. These are characterized by a sequential introduction 
of different machine learning methods. While the basis CAT approach applies a prob- 
abilistic channel-access mechanism based on measurements of the Signal-to-Noise- 
plus-Interference Ratio (SINR), Machine Learning CAT (ML-CAT) applies supervised 
learning for predicting the currently achievable data rate using features from the net- 
work context, the mobility context, and the application context domain. 


This approach is then further extended by Reinforcement Learning CAT (RL-CAT) 
through the autonomous detection and exploitation of favorable transmission op- 
portunities. Finally, Blackspot-Aware Contextual Bandit (BC-CB) integrates a priori 
knowledge about the geospatially-dependent uncertainties of the prediction model, 
which is uncovered by unsupervised machine learning. 


It is shown that machine learning-aided opportunistic data transfer is not only able 
to increase the average data rate of the individual transmissions; it also achieves a 
massive reduction of the occupied network resources and the power consumption of 
the mobile device. The price to pay is an increase of the Age of Information (Aol) of 
the sensor measurements. In addition to the presentation of the novel opportunistic 


332 —— 5 Communication Networks 


40 T 


Connectivity Hotspots Reliable and Fast Good Intra-cell High Resource 


30 Data Transfer Coexistence Efficiency 
Exploit for Transmissions 


m Proposal - Reinforcement 

£ 10 Link Loss Learning-based Data Transfer 

= M | vL NN Y 

o op- =5 a ai A E n A ll -H- 
Avoid for Transmissions 


-10 
Connectivity Valleys Packet Loss Retransmissions ‘OW Resource 
Efficiency 
-20 
0 200 400 600 800 1000 
Time [s] 


Fig. 5.18: Example for the temporal dynamics of the SINR in vehicular environments. 


data-transfer approaches, new machine learning enabled methods for simulating these 
anticipatory mobile networks are presented, discussed, and validated. 


5.2.1 Introduction 


According to a recent white paper [6] published by the 5G Automotive Association 
(5GAA), predictive Quality of Service (QoS)—e.g. the prediction of the data rate along a 
vehicular trajectory—is expected to become a key enabling method for future connected 
and autonomous driving. 

Although machine learning has already started to penetrate all areas of wireless 
communications [714], the current 5G standardization efforts focus on implementing 
intelligence on the network infrastructure side [1]. However, as discussed in initial 
visionary works [17], it is anticipated that not only the trend of replacing mathematical 
models with machine learning-based equivalents will continue, but also that pervasive 
intelligence will be a key driver for the further cellular network evolution. These develop- 
ments are closely related to the arising anticipatory mobile networking paradigm [107, 
630], which aims to improve decision processes within wireless communication systems 
through explicit consideration of context knowledge and machine learning-based data 
analysis. 

Figure 5.18 shows a real world trace of the Signal-to-Interference-plus-Noise Ratio 
(SINR) acquired along a vehicular trajectory. It can be seen that vehicular commu- 
nication channels are characterized by short-term and large-term fluctuations. This 
behavior is the result of a superposition of distance variations between sender and re- 
ceiver, mobility-related factors, and obstacle-related signal variation due to shadowing, 
reflection, and refraction. 
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In order to guarantee a reliable data transfer, the mobile device reduces the achievable 
transmission efficiency in favor of better data integrity during challenging radio channel 
periods. As a result of the implied overhead, a large amount of network resources is 
unavailable for transmitting the actual payload data. 


5.2.2 Related Work 


Opportunistic data transfer implements the idea of postponing the data transfer to 
of delay-tolerant applications to situations where a higher resource efficiency can be 
achieved due to better radio channel conditions. Acquired data is stored in a local 
buffer until a favorable transmission opportunity is detected and the whole data buffer 
is transferred. 

Channel-Aware Transmission (CAT) [296, 298], which represents the foundation 
for the further machine learning-based enhancements presented in this contribution, 
utilizes Signal-to-Interference-plus-Noise Ratio (SINR) measurements for client-based 
opportunistic data transmission based on the known significance of downlink qual- 
ity indicators for assessing the uplink radio channel quality [297]. The probabilistic 
medium access is performed as 


0, At < Atmin 
a 
p(t) = (RE) » Atmin < At < Atmax (5.4) 
1, At > Atmax 


whereas At is the elapsed time since the last transmission has been performed, At min 
is a minimum inter-packet gap in order to avoid overly frequent medium access, and 
Atmax is an application-specific deadline for the Age of Information (Aol) of the sensor 
data packets. Through configuration of a, it can be defined how much the transmission 
scheme prefers very high metric values within the transmission process. 


5.2.3 Machine Learning-Enabled Opportunistic Vehicle-to-Cloud Communication 


Although CAT has been demonstrated to achieve significant benefits in comparison 
with conventional data transmission approaches, recent analyses [638] have shown that 
physical layer indicators such as SINR have only a limited significance for estimating 
the achievable data rate. Since the latter is inversely proportional to the transmission 
duration, it is directly related to the resource occupation time. As a result, the maxi- 
mization of the end-to-end data rate contributes to improving the intra-cell resource 
efficiency. For the exploitation of this property, the novel data transfer schemes build 
upon predictions of the achievable end-to-end data rate. 

The methodological evolution of context-aware data transmission approaches is 
summarized in Figure 5.19. The different evolution stages are characterized through 
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Fig. 5.19: Evolution of channel-sensitive solution approaches for resource-efficient vehicle-to-cloud 
communication. 


a sequential introduction of solution approaches from different machine learning 

disciplines: 

Channel-Aware Transmission (CAT) [296, 298] uses a probabilistic medium-access 
approach that takes SINR into account. 

Machine Learning CAT (ML-CAT) [628, 629, 632] utilizes features from the network, 
mobility, and application domains for predicting the currently achievable end-to- 
end data rate, which is then used as the radio channel assessment metric. 

Reinforcement Learning CAT (RL-CAT) [636] replaces the heuristic medium access 
approach with a Q-learning mechanism to autonomously detect and exploit favor- 
able transmission opportunities. 

Black Spot-Aware Contextual Bandit (BC-CB) [622, 625] incorporates a priori knowl- 
edge about the geospatially dependent uncertainties of the predictions model as a 
measurement of trust into the latter. 


In the following, the enabling methods and novel data-transmission schemes are 
introduced. Additional details and analyses of various parameter variants are discussed 
in more detail in the referenced scientific publications. 


5.2.3.1 End-to-End Data-Rate Prediction in Vehicular Networks 

The considered dataset contains context traces in multiple vehicular evaluation sce- 

narios (campus, urban, suburban, highway). Using the native Android Application 

Programming Interface (API), context indicators from different logical domains are 

acquired: 

— Network context features Xnet: RSRP, RSRQ, SINR, CQI, TA, Carrier frequency 

— Mobility context features Xmop: Velocity, Cell Id 

- Application context features Xapp: Payload size of the sensor data packet to be 
transmitted 
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Fig. 5.20: Comparison of the resulting data-rate prediction accuracy achieved by different machine 
learning models. 


In addition to these passive indicators, the measured data rate of active transmissions 

in uplink and downlink direction with a random payload size ranging from 0.1 MB to 

10 MB is determined every 10s. An in-depth analysis of the statistical properties of the 

measurements is given in [638]. 

Using the resulting feature set X composed from the individual context vectors 

X = (Xnet, Xmob» Xapp), we trained a machine learning model fy; on the corresponding 

data rate measurements y such that fy, : X > y. For this purpose, different regression 

models are considered: 

Artificial Neural Network (ANN) with sigmoid action, two hidden layers, ten neurons 
per hidden layer, learning rate 7 = 0.1, momentum a = 0.001, and 500 training 
epochs. 

M5 Regression Tree (M5) 

Random Forest (RF) with 100 trees and a maximum tree depth of 15. 

Support Vector Machine (SVM) trained via Sequential Minimal Optimization (SMO) 
with Radial Basis Function (RBF) kernel, regularization parameter C = 1.0, and 
kernel coefficient y = 1.0. 


The training process is carried out using LIghtweight Machine learning for IoT Systems 
(LIMITS) [633], which provides high-level automation features for the well-known 
Waikato Environment for Knowledge Analysis (WEKA) framework and allows the export 
of C/C++ implementations of the trained prediction models. 

Figure 5.20 shows the Root Mean Square Error (RMSE) of the 10-fold cross-validation 
in both transmission directions. It can be seen that there are only minor differences 
for the more complex models if they are properly tuned. Even for the much simpler M5 
model, a comparably high prediction accuracy is achieved. The RF model achieves the 
lowest prediction errors in the uplink direction. In contrast to ANNs and SVMs, another 
advantage of this approach is a significantly lower complexity for the hyperparameter 
tuning. As a consequence of these considerations, the further analysis focuses on 
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Fig. 5.21: Comparison of RF-based data-rate predictions and corresponding measurements. The 
diagonal line corresponds to a hypothetical perfect prediction model. 


utilizing the RF model for performing the data-rate predictions. A scatterplot of the 
resulting uplink model is shown in Figure 5.21. 


5.2.3.2 Machine Learning CAT (ML-CAT) 

The basic idea of ML-CAT is to extend the CAT scheme with a machine learning-based 
metric for assessing the radio channel quality. While the latter is represented by the 
predicted data rate S(t) = fyy.(x(t)), the value range of the probabilistic transmission 
model is implicitly related to the value range of the SINR metric (0 dB to 40 dB according 
to [298]). Therefore, a normalization O(t) based on the value range [P nin, Pmax] of the 
transmission metric ®(t) is defined as 


p(t) = Dyin 
olt) = ——_—— 5.5 
(0) Dna 2 E (5.5) 
The transmission probability p(t) is then computed in analogy to Equation 5.4 
O, At < At min 
P(t) = < O(t)", Atmin < At < Atmax (5.6) 
1, At = Atmax 


5.2.3.3 Reinforcement Learning CAT (RL-CAT) 

With RL-CAT, the previously probabilistic medium access is replaced by a reinforcement 

learning approach. A schematic illustration of the interactions between the different 

logical entities is shown in Figure 5.22. The model consists of three core components: 

— The actual opportunistic data transfer is realized as an agent that learns to perform 
the possible actions—local buffering in expectation of future improvements or 
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Fig. 5.22: Interaction between the logical entities for reinforcement learning-enabled opportunistic 
data transfer. 


transmission of the whole sensor data buffer—through observation of the resulting 
rewards. 

— The environment is represented by the cellular network. In contrast to conven- 
tional reinforcement learning, which assumes that the actions taken by the agent 
have a significant impact of the state of the latter within the environment, external 
impact factors have a dominant influence on the end-to-end behavior. 

— Sensing is performed using the actual hardware platform based on the measurable 
context indicators. The raw measurements are brought together using an RF-based 
data-rate prediction model. 


The reinforcement learning-based action selection process utilizes a decision table Q 
for assessing the expected rewards of the possible actions amg and ary based ona 
given state represented by the context tuple q = (50, At) . Based on the available 
measurements, the action to be executed is determined as a = arg max, Q(ct, a). The 
classical Q-learning update process can be formulated as 


Q(cy, a) = (1-a): O(G, a)+a re + À - arg max Q(Ct+1, a) (5.7) 


whereas a represents the learning rate, A is the discount factor, and ra is the reward 
of the taken action a. However, as pointed out earlier, the agents impact on its own 
state can be regarded as negligible: even if the agent was capable of performing “op- 
timal” actions, the achievable end-to-end performance would be still impacted by 
non-controllable factors such as the network quality and the traffic load caused by 
other users. Therefore, a myopic approach that focuses on optimizing the immediate 
reward of the taken actions is implemented by setting A = 0, which results in the 
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simplified formula 
Q(ct, a) = (1-a)- Q(t, a) + a+ ra. (5.8) 


The action-specific reward functions are defined as 


rrx(S, At)= w- es ) +(1-w)- i (5.9) 
Data rate optimization Aol optimization 
and 
rmt) = i was A (5.10) 
O else 


Hereby, the parameter w allows us to control the fundamental trade-off between data- 
rate optimization and Aol reduction, S* is the target data rate, and Smax represents the 
upper data rate bound of the empirical measurements. Although there is no immediate 
reward if no data transfer is initiated, Q serves as a punishment factor if the buffering 
time At exceeds the application-specific deadline Atmax. 

Instead of performing a large number of real-world transmissions for training 
the reinforcement learning mechanisms, a Data-Driven Network Simulation (DDNS) 
setup is implemented according to [637]. In contrast to classical system-level network 
simulation, which requires a large number of assumptions and simplifications for 
setting up virtual representations of concrete real world scenarios, DDNS makes use of 
a combination of machine learning models and empirical context traces. This black 
spot approach does not require us to explicitly model communicating entities and 
achieves not only a close-to-reality representation of real-world behavior but also a 
massive computational efficiency. 


5.2.3.4 Black Spot-Aware Contextual Bandit (BS-CB) 

While BS-CB builds upon the reinforcement learning-based medium access approach 
of RL-CAT, it introduces additional mechanisms for accessing trust in the data-rate 
predictions. Moreover, it replaces the Q-learning component by a contextual bandit 
reinforcement model. A detailed description of BC-CB is given in [622]. 

In order to improve the data-rate prediction accuracy, the concept of black spot re- 
gions is introduced. Within those areas, the properties of the geographical environment 
lead to a significant increase in the location-specific prediction error (e.g., related to an 
increased handover probability). If knowledge about the presence about those black 
spots is available, transmissions can be postponed in order to avoid severe mismatches 
of predictions and measurements. For this purpose, BS-CB leverages a priori data about 
previous measurements in the targeted scenarios. Based on k-means-enabled unsuper- 
vised learning, the data is clustered into Nc clusters. For each cluster, the clusterwise 
RMSE is computed and compared with a given cluster threshold RMSEmax. All clusters 
that exceed the defined threshold are treated as black spot clusters and fitted to ellipses. 
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During the application phase of the transmission mechanism, the vehicle performs 
an ellipse test to check if it is currently within a black spots region. If the condition is 
fulfilled, the transmission process is postponed. 

The actual reinforcement learning process is modeled as a contextual bandit that 
proposed an action a (either IDLE for further local buffering or TX for data transmission) 
using 


a = arg max Te +a4/cTA7!C (5.11) 
acA —rn 
Estimated reward CB 


whereas 0 corresponds to the ridge regression coefficients of action a. c = (500, At) 


is the d-dimensional context tuple for the predicted data rate S(t) and the current 
buffering delay At. 
The degree of exploration is controlled using the greediness parameter 6 


E (512) 


After either the IDLE or the TX has been performed, the regression coefficients are 
updated as 
ĝa & Az ba (5.13) 


with 
ba € ba + Ya s C. (5.14) 


For determining the actual rewards of the chosen actions, the reward functions of 
RL-CAT are re-utilized according to Equation 5.9 and Equation 5.10. 


5.2.4 Results of the Real-World Performance Comparison 


For the performance evaluation of the novel machine learning-enabled methods, a 
25 km long evaluation track with varying environmental characteristics, speed limita- 
tions, and building densities is considered. For each transmission scheme, ten real- 
world drive tests are performed. Hereby, a virtual sensor application generates 50 kB of 
sensor data per second, which is buffered locally until a transmission decision is taken 
for the whole buffer. 

Figure 5.23 shows the achieved end-to-end data rate of the transmission schemes. 
While the basic channel-sensitive approach of CAT is already able to achieve a signif- 
icant improvement of the data rate, the latter is highly increased through the intro- 
duction of machine learning-based channel assessment. Moreover, the reinforcement 
learning-based data transfer results in additional gains. In comparison to conventional 
fixed interval data transmission, BS-CB achieves performance improvements of 195 % 
in uplink and 223 % in downlink direction. 
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Fig. 5.23: End-to-end data rate of the different transmission schemes. 


As shown in Figure 5.24, the apparently selfish goal of data rate maximization con- 
tributes to improving the good of all: all opportunistic data transfer methods are able 
to achieve a significantly better resource efficiency than the conventional approach. 
Although the methodological evolution is also represented in the achieved results, there 
are only minor differences between the machine learning-enabled methods. Here, BS- 
CB reduces the number of occupied cell resources by around 85 % in both transmission 
directions. 


5.2.5 Outlook and Future Work 


Due to its enabling character for all presented transmission schemes, future work 
should focus on optimizing the accuracy of the data-rate prediction model. A major 
limitation of client-based prediction approaches is their limited insight into the current 
traffic load within the cell. Future networks could compensate this limitation through 
active announcement of network infrastructure knowledge about the traffic load, e.g., 
acquired through novel 5G mechanisms such as the Network Data Analytic Function 
(NWDAF). As shown by a recent feasibility study [626], the integration of available 
network knowledge reduces the RMSE by 25 % in the uplink and 30 % in the downlink 
direction. 
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Fig. 5.24: Resource efficiency of the different transmission schemes. 
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Abstract: The analysis of mobile network data is a fundamental requirement for the 
development and invention of novel networking approaches that fulfill the rapidly 
growing requirements and demands on those networks. This process requires the iden- 
tification and a thorough investigation of shortcomings in existing field deployments, 
independent of the network operators and/or the network equipment vendors. De- 
spite public standardization by the 3rd Generation Partnership Project (3GPP), cellular 
networks are developed and operated as closed systems that provide a predefined net- 
working service to the subscriber while disclosing only a minimum of system-related 
information such as signal strength or quality in the User Equipment (UE). 


However, researchers often require a deeper insight into network functionality, espe- 
cially when it comes to considering network load and occasional congestions while 
still maintaining the privacy of the regular network users. With this knowledge, future 
devices may predict their achievable throughput passively under the current load and 
channel conditions without the need of triggering a transmission just for the sake of 
throughput measurements that in turn induces (unnecessary) network load. They may 
leverage this predicted information for e.g. load balancing, network selection, or service 
adaptation. Since cellular networks are centrally governed by the base stations, which 
assign the spectral resources by explicit and fine-grained signaling to each active device 
in the coverage area, information about the cell-wide resource utilization is already “in 
the air”. For performance reasons and unlike the ciphered payload exchange between 
UE and the base station, the control messages that carry the resource assignments are 
not encrypted. 


However, these messages are scrambled by a device- and session-specific Radio 
Network Temporary Identifier (RNTI), which is essential for the proper interpretation 
and validation of those messages and which is exchanged only once at the beginning 
of each session. This section describes the achievements of the CRC 876 in extracting 
these control messages of new and already active sessions efficiently and reliably over 
the air and without the need for expensive specialized hardware. The methodology 
of the underlying control channel analysis is embedded into a comprehensive open- 
source software framework Fast Analysis of LTE Control Channels (FALCON), which 
uses Software-Defined Radios (SDRs) to capture the base station’s signal and accurately 
extracts the control messages in real time on a regular computer system. 
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In subsequent case studies, supervised learning is used to leverage the disclosed 
network-load information from short-term observations for the prediction of the ex- 
pected data rate and ultimately uses this as a metric for dynamic network selection to 
achieve the highest throughput over the fastest network connection currently available. 


5.3.1 Introduction 


The steady increase in data traffic in mobile networks, triggered by the rapidly growing 
number of human and machine network subscribers, poses a challenge to both network 
operators and the services that depend on them in face of the limited radio spectrum 
for meeting the simultaneously growing demands on quality of service. Achievable 
data rates depend on the one hand on the cell bandwidth and signal quality, and on 
the other hand on the activity of other cell users competing simultaneously for the 
available radio resources. One of the possible strategies is to use higher frequency 
ranges, in which higher bandwidths and thus more spectral resources are available for 
transmission. However, due to the inherently higher signal attenuation, these frequency 
ranges are only suitable for covering smaller areas, so that region-wide coverage is only 
economical with a correspondingly high user density. In order to meet the growing 
demands in the remaining areas and to counteract bottlenecks, the usage efficiency of 
the available resources must therefore be further increased. For example, subscribers 
could switch to less busy networks or perform delay-tolerant data transmissions only 
when channel and load situations are favorable. 

However, both the research and the application of such mitigation strategies require 
the accurate measurement of both signal quality and instantaneous network load in 
order to identify overload situations without creating unnecessary load themselves, e.g. 
in the form of test transmissions. Even though mobile devices measure signal strength 
and quality autonomously, present mobile networks allow users and external observers 
only a very limited insight into the momentary resource utilization of the cell. Although 
the total occupancy of radio resources can be determined by spectrum analysis (cf. 
top row in Figure 5.25), the actual degree of contention in the case of full occupancy 
remains concealed since the number of served subscribers cannot be identified in the 
spectrum (cf. last two columns in Figure 5.25). 

In 4G and 5G networks, the distribution of spectral resources is governed by the 
base station, which explicitly allocates its resources to single active subscribers via 
special control channels. For efficiency reasons, these allocation messages are not 
encrypted, but reliable decoding requires knowledge of the addressee’s RNTI, since the 
attached checksum is scrambled with it. As a result, UEs can only read the assignments 
that affect them. Inactive users without assigned RNTI and external observers can 
decode only the assignments for specially reserved RNTIs that concern general system 
information or paging. 
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Fig. 5.25: If a new user Starts a transmission at to, the number of allocated spectral resources 
(shaded in blue) depends on the number and activity of other participants (bottom row, other colors). 
In the spectrum analysis (upper row), the allocations of individual users are indistinguishable, and 
the number of participants remains unknown. ©[2016] IEEE. Reprinted, with permission, from [195]. 


In the course of the CRC 876 research, efficient control channel analysis methods 
for finding valid RNTIs have been developed and evaluated, enabling passive load 
sensing of the mobile network and thus providing valuable information for a client- 
side data-rate prediction. The presented approaches are directly applicable to public 
4G cellular networks and enable real-time discovery of all resource allocations using 
off-the-shelf PCs and SDRs. A comprehensive reference implementation is provided in 
the form of the open-source framework FALCON. Using the collected data and derived 
features to characterize the network load, supervised learning is used to train prediction 
models that enable data-rate prediction whose accuracy significantly exceeds previous 
approaches based purely on signal strength. Applied simultaneously to multiple cellular 
networks, the prediction enables the UE to perform predictive network selection in order 
to transmit data over the network with the most promising data rate, especially during 
high-load periods. The prediction accuracy, achievable data rate gain, and impact on 
UE energy consumption are evaluated using case study data collected in public mobile 
networks. 

The following sections are structured as follows: Section 5.3.2 presents related work 
in the area of control channel analysis. Subsequently, Section 5.3.3 discusses methods 
for analyzing control channels and presents some further implications that can be 
derived from observing the cell activity. Key findings are summarized and a conclusion 
is drawn in Section 5.3.5. 
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5.3.2 Related Work 


The assessment of the current mobile network connection, especially in terms of link 
throughput, is commonly done by means of active probing [289]. This means that a 
data transmission is triggered to measure the throughput currently available for the 
device under test. However, this approach loads network and radio resources, which 
may be omitted by a purely passive measurement or prediction. 

The device itself has only a limited view on the current network load, yet it provides 
performance indicators such as Reference Signal Received Power (RSRP) and Reference 
Signal Received Quality (RSRQ), which can be used only for a rough forecast of the 
achievable data rate of subsequent transmissions [297]. Authors in [392] additionally 
utilize details from lower protocol layers and the chipset. 

More promising solution approaches need to consider further information that is 
usually outside the scope of the mobile device. For this reason, expensive commercial 
tools with special hardware requirements[695] as well as off-the-shelf SDRs and open- 
source protocol stacks allow tailored solutions based on deep insights into the signaling 
protocol behavior and related routines. In terms of the SDR-based approach, especially 
LTEye from [359] and Online Watcher for LTE (OWL) from [108] deal with the analysis 
of the control channel for resource allocations to infer the current resource utilization 
and concurrently active users. As will be detailed in the next section, LTEye suffers from 
numerous false-positive detections, while OWL constitutes a solid, real-time capable 
approach which only detects new devices though. In contrast to that, our approach 
FALCON [196] implements improved detection capabilities of the resource utilization 
in mobile networks and is even able to forecast or recommend the most performant 
network at a given time. 


5.3.3 Control Channel Analysis 


In current mobile networks, radio resources are divided by time and frequency in the 
manner of a two-dimensional resource grid. The resource grid spans the cell bandwidth 
in frequency domain and the time domain is divided into a nested and periodic struc- 
ture of symbols, slots, subframes, and frames. The smallest resource unit in 4G and 
5G networks is the Resource Element (RE), which corresponds to a single subcarrier of 
an Orthogonal Frequency Division Multiplexing (OFDM) symbol. According to a prede- 
fined pattern, some REs are used to broadcast synchronization sequences or provide 
reference levels for equalization. REs without a special purpose serve as resources for 
the transmission of any other data, including control and payload messages. These 
spare REs are grouped into equal-sized Resource Blocks (RBs), which are the smallest 
unit of resources that can be allocated to individual UEs. In common 4G Networks, a RE 
spans 12 subcarriers in frequency domain and 7 symbols in time, which corresponds to 
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a bandwidth of 180 kHz and a duration of 0.5 ms. However, resource allocations always 
apply for both slots (0.5 ms) of a subframe (1 ms). 

The allocation of RBs is organized centrally by the base station and signaled to the 
UEs via a special control channel, namely Physical Downlink Control Channel (PDCCH), 
which is located in the first 1, 2, or 3 symbols of every subframe and spans the entire cell 
bandwidth. It includes assignments for transmissions in both directions, the downlink 
(i.e. from the base station to the UE) and the uplink (i.e. from the UE to the base station). 
These apply in the downlink (DL) direction for the current subframe or in the uplink (UL) 
direction 4 subframes later to give the UE enough time to prepare. 

From a logical point of view, the PDCCH consists of a sequence of Control Channel 
Elements (CCEs), each comprising 36 REs, whose total number is calculated from the 
cell bandwidth and the number of occupied OFDM symbols. These CCEs carry the 
encoded Downlink Control Information (DCI) for single UEs, which contain the RB 
allocation, the Modulation and Coding Scheme (MCS), the power control commands, 
and further control information required for decoding or encoding the payload in the 
allocated resources. Base stations use rate 1/3 channel coding, interleaving, and rate 
matching for each emitted DCI data structure to provide FEC and to fit the encoded 
sequence into L, L € {1, 2, 4, 8} consecutive CCEs. The aggregation level L is selected 
by the base station according to the channel conditions of the addressee to ensure 
proper reception. Any additional or spare space within the L CCEs is filled by cyclic 
repetitions of the encoded sequence and interleaving ensures an even distribution of 
repeated bits. 

Prior to the encoding, each DCI is appended with its 16-bit Cyclic Redundancy 
Check (CRC) checksum, which is additionally scrambled (via binary XOR) with the 
RNTI of the addressee. Conversely, receiving UEs only consider decoded DCI where the 
CRC matches their current RNTI.1 

Since the PDCCH has no table of contents, only blind decoding of the CCEs can 
determine whether relevant information is present. This also includes all possible 
combinations resulting from different L. To reduce the number of decoding attempts 
for a UE, the standard defines a search space function that restricts the search space to 
a maximum of 22 evenly distributed locations according to RNTI, subframe, and L. 

Furthermore, the standard defines numerous DCI formats for different transmission 
modes, which depend on the number of antennas used and the capabilities of the UE 
and the evolved NodeB (eNodeB) as 4G base station. Transmission modes are negotiated 
both when connections are established and dynamically depending on the channel 
conditions. The DCI formats differ in their length and consequently in the length of the 
encoded sequence. However, the same circular approach is used for rate matching, i.e. 
to populate the CCEs so that the initial format is no longer apparent in the transmitted 


1 The UE also tracks special reserved RNTIs for system information and paging as required. 
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sequence. End devices therefore decode the sequences multiple times, assuming any 

DCI formats (or their lengths) that comply with the specified transmission mode. 
Consequently, an external observer faces the following challenges when decoding 

the entire content of the PDCCH: 

1. Numerous decoding attempts covering all locations of the PDCCH including all pos- 
sible combinations of DCI format length and aggregation level are computationally 
expensive and result in many invalid DCI candidates. 

2. Discard all DCI candidates with CRC that do not match a valid RNTI. 

3. Find valid RNTIs within the set of decoded DCI candidates. 


The chicken-or-egg situation resulting from the last two points can be resolved in several 
ways: LTEye [359] re-encodes each decoded DCI candidate, compares the encoded 
sequence with the received bits on the channel, and discards any candidates that 
deviate by a certain degree. But in the presence of noise or interference, we show that 
this approach is highly inaccurate and leads to numerous false decisions. In a more 
robust approach, [108] follows the initial connection establishment of joining UEs, 
which contains the RNTI assignment in plain text, and builds up a list of valid RNTIs. 
However, RNTIs of UEs that entered the cell before the monitoring remain undetected. 
Therefore, OWL follows the approach of [359] as a fallback. By contrast, the authors of 
UnCover Information in Mobile Access Networks (U-CIMAN) [771] propose to first accept 
any DCI candidate and to decode the potential payload in the allocated RBs. If this 
attempt fails due to an invalid CRC of the payload, the DCI is discarded. The approach 
involves a significant computational cost due to the larger amount of data and the more 
complex decoding procedure for the payload data. 

In this area, CRC 876 has made substantial contributions to a resource-efficient yet 
reliable control channel analysis that is especially suitable for short-term monitoring in 
order to estimate the total cell load. In [195] we propose a histogram-based approach in 
conjunction with an inverse application of the search space function to identify valid 
RNTIs and decode the corresponding DCI candidates. First, DCI candidates decoded 
from all possible locations, formats, and aggregation levels are validated with respect 
to their permitted positions, as the eNodeB never places DCI outside their associated 
search space. This approach reliably filters 80-90 % of all candidates, including invalid 
DCI. 

The following filter stage first collects the RNTIs of all DCI candidates in a history. 
According to an attached histogram, all candidates are discarded whose RNTI in the 
histogram does not exceed a threshold value k. It is based on the fact that active 
participants receive multiple assignments within a short period of time and that their 
RNTIs occur more frequently than the random RNTIs that result from decoding with 
incorrect parameters. An example is given in Figure 5.26. 

Length of history and threshold value are optimization parameters that allowa 
trade-off between the probability of false positive detection, the minimum required 
activity of individual UEs, and the detection delay [194]. 
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Fig. 5.26: Example for the histogram-based RNTI validation approach. RNTIs of active UEs appear 
with high frequency, while RNTIs of false DCI uniformly spread over the entire value range with low 
frequency. ©[2016] IEEE. Reprinted, with permission, from [195]. 
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Fig. 5.27: Recursive PDCCH analysis with the short-cut decoding approach for immediate yet reliable 
discovery of unseen RNTIs. ©[2019] IEEE. Reprinted, with permission, from [196]. 


To improve the detection speed of unseen RNTIs, which is especially important for 
short-term observations, we proposed a novel short-cut decoding approach in [196]. 
The approach exploits the scheme of how the eNodeB populates the CCEs with the 
encoded DCI sequence. Although in most cases such sequence fits into a single CCE, 
operators configure the eNodeB to use higher aggregation levels in order to increase 
the robustness against distortions. Conversely, due to the circular repetition, a properly 
cropped sequence still allows for a correct decoding of the DCI. Therefore, if both 
decoding of the full and the shortened sequence result in the same DCI and CRC, the 
associated RNTI can be assumed as valid and the DCI shall be accepted. This approach 
can be implemented efficiently by combining a breadth-first search with a depth-first 
search for each location as shown in Figure 5.27. The top line shows the PDCCH as 
a sequence of consecutive CCEs, which are either occupied or empty according to 
the placement done by the eNodeB. Empty CCEs, recognizable by insufficient signal 
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Fig. 5.28: Simulation results for decoding the PDCCH of 1s Long Term Evolution (LTE) signal with 
FALCON and OWL for different Signal to Noise Ratio (SNR) values. Results are averaged over 
10 repetitions; the standard deviation is shown as error bars. 


power, can be skipped as they cannot contain any meaningful information. Conversely, 
occupied CCEs may also not have meaningful information, as the received signal power 
may originate from a neighbor cell with overlapping PDCCH in time and frequency. The 
breadth-first search component starts with aggregation level L = 8 and sequentially 
decodes all locations at this level (second line). In the given example, the two possible 
locations (1) and (2) are skipped, as each overlaps at least one empty CCE, and hence 
does not form a continuous sequence. The search continues with L = 4, inspecting the 
first location (3) with continuous CCE occupation by decoding the sequence for all DCI 
formats. If this inspection does not result in any DCI with a known RNTI, the depth-first 
component is activated and the location is inspected recursively using the next-smaller 
aggregation level. In the given example both locations (4) and (5) contain DCI with 
known RNTIs; overlapping locations (e.g. for L = 1) are marked as checked. As the 
recursion terminates, the breadth-first search continues with locations (6) and (7) both 
being skipped. Next, location (8) contains a valid DCI at L = 4 but the RNTI has not yet 
been seen. However, the recursive inspection of the shortened sequence, given by the 
first half at location (9), returns the same DCI and RNTI. As this only happens for valid 
DCI, the RNTI is immediately added to the list of known RNTIs, the DCI is accepted, 
and overlapping locations are marked as checked. 

To enable detection even in the case of poor signal quality, where the bisected 
sequence can no longer be decoded correctly, histogram-based validation can be em- 
ployed afterwards. If a recursive descent does not discover a known RNTI, all potential 
RNTIs along the descent path are added to the history and RNTIs exceeding a threshold 
are added to the active set as described above. 
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Robustness and reliability of the combined approach in FALCON is presented in Fig- 
ure 5.28. It shows the number of missed and false DCI messages as solid and dashed 
lines, respectively, as a function of the SNR on the channel after analyzing a well-defined 
LTE signal for the duration of 1s. Starting with 0 dB for poor radio conditions, the SNR is 
increased in steps of 0.5 dB to 15 dB, representing an excellent signal. For each step, the 
figure shows the average and the standard deviation over 10 repetitions. Furthermore, 
the figure also contains the results of OWL, which relies on the re-encoding approach 
for short-term observations. 

Independently of the SNR, the amount of spurious DCI messages stays at a negligi- 
ble level at FALCON, whereas OWL produces at least 30 and up to 100 false detections. 
In general, for all covered SNR, FALCON misses significantly fewer DCI messages than 
OWL. Especially for SNR values greater than 7 dB the number of missed DCI messages 
undershoots 10 for FALCON, while OWL remains on a level between 50 and 100. Thus, 
the comparison of both approaches reveals the robustness and reliability of FALCON. 

Similar results are achieved in the field as an activity histogram of each RNTI over 
5s as shown in Figure 5.29. Blue circles represent the number of resource allocations 
detected by FALCON for each RNTI, and red crosses show the results of OWL. It is 
evident that the most active RNTI concentrate in a small value interval, indicating that 
the eNodeB assigns RNTIs consecutively to new UEs. The peak region moves over time 
towards larger RNTI values, as shown in the results 5 min and 10 min later. On the other 
hand, OWL reports numerous spurious DCI messages with random RNTIs, which are 
uniformly distributed over the entire value range with very low frequency. 
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Fig. 5.29: Activity histogram of individual RNTI detections from field measurements. The highlighted 
peak moving over time indicates a RNTI assignment strategy of the base station. ©[2019] IEEE. 


Reprinted, with permission, from [196]. 
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5.3.4 Cooperative Data Rate Prediction Leveraging FALCON 


Client-based data-rate prediction is a key enabler for anticipatory mobile networking. 
By considering network context measurements to estimate the end-to-end transmis- 
sion efficiency—represented by the predicted data rate—mobile clients can actively 
contribute to optimizing the intra-cell resource efficiency by scheduling data-intense 
transmission to resource-efficient connectivity hotspots [622]. However, the accuracy of 
client-based data-rate prediction methods is inherently limited since the UEs are only 
aware of the radio channel conditions but not of the network load. 

In addition to the passive context measurements of purely client-based data-rate 
prediction according to [622], FALCON allows the derivation of additional features 
(number of Physical Resource Blocks (PRBs) and UEs, Transport Block Size (TBS), 
MCS) that are correlated to the current network load of the cell. For a proof-of-concept 
evaluation, the following feature sets are derived for the two transmission directions 
after an initial feature importance analysis: 

— Uplink feature set: RSRP, RSRQ, velocity, payload size, number of PRBs, number 

of UEs, cell ID 
— Downlink feature set: RSRQ, velocity, payload size, number of PRBs, number of 

UEs, TBS, MCS, cell ID 


It can be seen that the uplink direction is more sensitive to the radio channel conditions 
while the downlink performance highly depends on the intra-cell traffic load. 
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Fig. 5.30: End-to-end uplink data-rate prediction: performance comparison of different prediction 
approaches and machine learning models. ©[2020] IEEE. Reprinted, with permission, from [626]. 


The resulting prediction accuracy of different machine learning models (Artificial 
Neural Network (ANN), M5 regression tree, Random Forest (RF), Support Vector Ma- 
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chine (SVM)) is shown in Figure 5.30. Although the RF model achieves the highest 
overall accuracy, there are only minor differences between the machine learning mod- 
els. However, significant differences can be observed for the three different data-rate 
prediction methods. While the purely client-based approach is mostly unaware of the 
traffic load and the purely network-based approach yields a high prediction error due to 
the absence of radio channel information, the cooperative prediction method reduces 
the average Root Mean Squared Error (RMSE) by 25 % in the uplink direction. As further 
analyzed in [626], similar improvements are also achieved in the downlink direction. 

These initial results show that the context-awareness and the predictability of mo- 
bile communications can be significantly improved by combining client measurements 
with network-side information. Therefore, future networks such as 6G should actively 
provide network-load information to the clients in order to allow them to actively par- 
ticipate in network management functions. 


5.3.5 Conclusion 


FALCON is a novel open-source and SDR-based instrument for LTE control channel 
analysis that allows the reliable monitoring of the resource allocations of LTE cells in 
real-time. Through the application of shortcut-precoding, a fast DCI integrity check 
is achieved and the list of active RNTIs—which corresponds to an estimation of the 
number of active users—is derived. Aided by a histogram approach, the accuracy of 
FALCON is maintained even during low signal quality periods. The revealed network- 
side information is of particular value for intelligent networking methods that utilize 
end-to-end predictions for their decision making, such as resource-efficient vehicle-to- 
cloud communications that is discussed in Section 5.2. As purely client-based data-rate 
prediction approaches that rely on network context measurements are unaware of the 
current network load, their achievable prediction accuracy is inherently limited. As 
demonstrated in a first real-world proof-of-concept study [626], the incorporation of 
FALCON offers the potential to improve client-based data-rate prediction methods by 
up to 25 % in the uplink and 30 % in the downlink direction. 
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5.4 Machine Learning-Enabled 5G Network Slicing 


Caner Bektas 
Fabian Kurtz 
Dennis Overbeck 
Christian Wietfeld 


Abstract: In this contribution, the different parts of the end-to-end network slicing 
concept are presented, including the Core Network (CN) and the Radio Access Network 
(RAN), while highlighting the differences and similarities of both domains. 


Further, prototypical implementations and empirical evaluations of 5G network 
slicing are discussed, deepening the understanding of network slicing and identifying 
possible advantages and challenges. The predictability of user traffic in the respective 
network slices poses such a challenge, as resources in the RAN-in contrast to resources 
in the CN—are subject to fluctuations based on channel quality. Critical infrastructures 
typically require very low latencies in the single-digit milliseconds range and are thus 
considered ultra-Reliable Low Latency Communication (uRLLC) . To mitigate latency- 
intensive scheduling requests and grant operations, resources in the RAN have to be 
pre-allocated for uRLLC slices. 


This operation, also known under the term Configured Grants (CGs), pre-allocates 
resources for, say, high-priority slices, so that User Equipments (UEs) are able to send 
data without asking for resources, which reduces the scheduling latency down to zero. 
The simplest method for calculating CGs is based on static allocations, which has one 
major drawback: unused resources are wasted, and thus, can not be used by remaining 
slices, effectively lowering spectral efficiency. Here, we present SAMUS (Slice-Aware 
Machine Learning-based Ultra-Reliable Scheduling), a data-driven method to predict 
resources in the future based on real data, e.g., solar activity in smart grid slices, to 
reduce latencies while maintaining high spectral efficiency. 


5.4.1 Introduction to 5G End-to-End Network Slicing 


Critical infrastructures, such as energy networks, logistics, or autonomous transporta- 
tion, are becoming more and more automated to further increase efficiency. Automation 
is often achieved by the self-organization of processes and actors via mobile communi- 
cation systems. As many different vertical industries are reliant on mobile communica- 
tion, a highly diverse set of Key Performance Indicators (KPIs) need considering by the 
communication systems. 
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Until recently, dedicated communication networks were the go-to solution in order 
to meet these divergent criteria, as they can be designed specifically for the needs of 
the respective critical infrastructures. Consequentially, the fifth generation of mobile 
networks (5G) aims to unify these different and partly contradictory set of requirements 
into a single physical infrastructure. Employing Software-Defined Networking (SDN) 
and Network Function Virtualization (NFV) techniques, 5G network slicing is integrated 
into the 5G standard. By utilizing virtual dedicated networks called network slices on 
top of a single physical communication network, various vertical industries can be 
automated, as shown in Figure 5.31. 
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Fig. 5.31: Network slicing as a key enabler for fulfilling all specific service requirements simultane- 
ously. 


5.4.2 5G Core Network Slicing 


5.4.2.1 Description and Methodology 
The virtualization of network resources depicts a main pillar of 5G networking as 
illustrated in Figure 5.32. 

The creation of multiple isolated network partitions known as slices can indepen- 
dently and efficiently manage different use cases with their respective demands on 
QoS or other guarantees. For 5G communication, three main categories are defined. 
The first category is enhanced Mobile Broadband (eMBB), which is used for data-rate 
intensive services (up to 20 Gbit/s). This category comprises ultra-high resolution video 
streaming as well as fixed wireless broadband and Augmented respectively Virtual Re- 
ality (AR/VR) . The second category is massive Machine Type Communication (mMTC) 
, designed for the emerging Internet of Things (IoT) and Industry 4.0 applications, 
both of which introduce a significant increase in inter-machine communication. But 
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Fig. 5.32: Overview of a sliced 5G communication network. 


a massive amount of devices comes with its own set of challenges in, say, electrical 
power grid application scenarios such as smart metering. The third major category is 
the ultra-Reliable and Low Latency Communication (uRLLC) service. This category com- 
prises services such as Intelligent Transportation Systems (ITS) with Floating Car Data 
(FCD)-based Vehicle-to-X communication. Here, mission-critical and latency-sensitive 
applications are addressed. Our approach builds on NFV and SDN, which are closely 
related. With NFV, hard- and software is decoupled and functionalities are abstracted 
in order to achieve highly flexible communication infrastructures for enabling cloud 
computing. Virtual Network Functions (VNFs) are now able to run on Commercial-Off- 
The-Shelf (COTS) server platforms . By using the complementary approach of NFV and 
SDN, the controller can dynamically route traffic flows between the VNFs, while being 
deployed as a VNF itself. In addition to the utilization of SDN and NFV, our concept 
is based on queuing strategies utilizing the Hierarchical Token Bucket (HTB) . On the 
bare-metal and virtualized data-plane devices, the switching software Open vSwitch 
(OVS) is deployed. Furthermore, a Management and Network Orchestration (MANO) 
controller is implemented. This controller creates a main bridge in each switch, which 
includes the respective physical ports. By that, one bridge per slice is added or removed 
as needed. The concept is depicted in Figure 5.33. 
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Fig. 5.33: The developed network slicing architecture. 


The slice bridges comprise virtual ports, which are residing within the main bridge. 
The orchestration is done via the MANO controller, which dynamically instantiates 
slice controllers (e.g. via Docker), which in turn can be optimized for each application 
scenario. In the event of traffic entering the data plane, the MANO controller assigns 
packets to the respective bridge. There, the flow is mapped to the respective QoS queue 
and virtual destination port on the main bridge regarding the specific protocol or other 
criteria of the packet. This is done by the respective slice’s controller. For each hop, 
the slice controller repeats this procedure of directing the flows to the main bridge. 
Unknown flows or not specified matches are handled on a best-effort basis. While this 
first part focuses on wired 5G communications, compatibility with the air interface 
slicing technologies presented in later sections. 


5.4.2.2 Empirical Evaluation of 5G Core Network Slicing 

Overview of the Testing Environment The testbed scenario is depicted in Figure 5.34. 
Six servers are assigned in pairs for each of the different use cases. These servers func- 
tion as hosts to either send or receive data traffic over the sliced network. Furthermore, 
four machines are designated as SDN controllers, where three of them act as slice 
controllers running Floodlight and one is the MANO controller employing Ryu. For 
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Fig. 5.34: Evaluation scenario in the testing setup. 


measurement interference avoidance, three different networks, namely out-of-band 
control, maintenance, and sliced data-plane network, are in use. The Precision Time 
Protocol (PTP) is utilized to synchronize the controller clocks with a maximum devia- 
tion of 153 ys and a mean deviation of 16 ys. The underlying network load is generated 
via iPerf2 community edition and consists of User Datagram Protocol (UDP) packets. 
The maximum performance of the evaluated methods needs to be determined by con- 
sidering the different layers of the ISO-OSI stack. The Ethernet frame size on the 2nd 
layer of the OSI model is 1512 B, which is used as a point of reference. Since performance 
evaluations are located on layer 4, the payload (i.e. goodput) results in 1470 B, which is 
97.2 % of the layer 2 data rate. The following measurements were repeated at least 100 
times with a minimum duration of 1 min per run. 


Evaluation Scenarios Scenario A depicts a performance study, where key aspects 
from 5G and critical infrastructure communication are evaluated such as delay and data 
rate for varying network loads. Therefore, the overhead of our approach is determined to 
demonstrate the efficient use of resources. By using 100 Mbit/s Ethernet links, possible 
limitations can be avoided while simultaneously affording the option of CPU load 
monitoring during testing. Moreover, the independence of the slices from each other 
is verified, so any detrimental effects of errors or overload in one slice harming other 
slices can be precluded. Within the evaluation, the network load is increased in steps, 
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Tab. 5.1: Slices and traffic flows of the critical infrastructure communications scenario. 


Slice (descending Use case 5G service Priority within Hard min. data Max. delay 
priority) class slice rate [Mbit/s] [ms] 
Smart grids Protection uRLLC Highest 50 1 
(IEC 61850) 
Smart mMTC High 200 20 
metering 
Intelligent Trans- Floating car uRLLC Highest 100 1 
portation Systems data 
Passenger eMBB Low 450 10 
Internet 
Best-effort Multimedia None Lowest None 100 


reaching beyond the maximum usable data rate/goodput, i.e. 97.2 % of the nominal 
layer 2 link capacity. 

This approach represents cases in which end users try to use more resources than 
allocated for their respective slice, thus serving to demonstrate slice independence. The 
misconfiguration by operators of sliced communication networks is simulated as well. 
For this, two slices whose combined data rate exceeds the underlying physical network’s 
data rate are configured. Scenario B represents a scalability analysis, where a viable 
approach for deployment in largescale, multitenant communication infrastructure is 
demonstrated. Since the number of slices should not influence the overall network 
performance, the delay performance for no, 2, 8, and 16 slices is analyzed. The available 
data rate is shared equally among the slices, with traffic streams utilizing 100 % of 
the respective slice’s capacity. This ensures the exclusivity of side effects caused by 
slicing and not by network congestion or other factors. Furthermore, the validity of 
slice isolation and the stability of end-to-end delay is examined. For this, seven out of 
21 slices are subjected to UDP-based traffic with data rates above the allocated limit. 
In contrast to scenario A, 1 Gbit/s Ethernet is used to stress test the concept. Finally, 
scenario C depicts a critical infrastructure communication including FCD of ITS and 
the IEC 61850 SG protocol. 

Since both use cases are considered as uRLLC 5G services and therefore assigned 
the highest priority, the slices on which they are transmitted is allocated equal priority, 
including the dedicated SDN controllers. Smart metering (representing mMTC) and 
passenger internet (eMBB) are included in the related slices to demonstrate the ability 
of traffic distinction in our solution. Moreover, a best-effort slice is included for handling 
multimedia traffic and perpetually transmitting low-priority data at 950 Mbit/s, roughly 
consuming the maximum available layer 4 goodput of the 1 Gbit/s network. Therefore, 
if another slice needs a specific data rate, the network is overloaded and reallocated 
due to its differing priorities. The maximum tolerable delays for each priority data is 
given in Table 5.1. For the evaluation, the given data rates were achieved by bundling 
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multiple traffic flows. However, the bundling of constant data rates can be found in 
real-world use cases such as smart metering. 
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Fig. 5.35: End-to-end delays of two slices for varying traffic loads. 


Evaluation Results In Figure 5.35 the end-to-end delays of two slices for varying traffic 
loads are depicted [360]. The physical 100 Mbit/s Ethernet network is shared fairly; 
slices A and B transmit UDP packets. Below the aforementioned limit of 97.2%, the 
median end-to-end delay is located at 1.05 ms with a variance of approximately 0.05 ms. 
Nevertheless, when step-wise exceeding the limit at slice B, an overload situation is 
created resulting in increased delays. At 101 % load the median delays rise sharply up 
to 1.212ms. However, the delays at slice A remain unaffected, even compared to no 
slicing as depicted with the enlarged violin plots. Hence, the isolation of the slices is 
shown. The misconfiguration by the operator is simulated and depicted in Figure 5.36. 

Slice A receives a data rate of 40 Mbit/s (38.9 Mbit/s effectively on layer 4). The total 
sum of queue data rates (depicted on the x-axis) should not exceed the theoretical layer 
2 limit of the 100 Mbit/s Ethernet link. This maximum is calculated as the ratio between 
frame sizes at layers 2 and 1, which amounts to 1512 B/1532B = 98.7%. In the event 
of misconfiguration, slice B tries to utilize resources, which do not exist. Therefore, 
slice B cannot maintain the layer 4 goodput. While slice A consumes the HTB tokens 
and remains stable, the data rate of slice B levels out to 56.9 Mbit/s, which is below 
the configured 60 Mbit/s queue data rate on layer 2. Overhead in terms of achieved 
throughput is not observed and therefore confirms expectations. Figure 5.37 depicts 
scenario B evaluation results. 
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Fig. 5.36: Impact analysis of misconfiguration by the physical network operator. 
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Fig. 5.37: Evaluation results of multiple slices on end-to-end delays. 
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The end-to-end delays of dedicated, sliced networks are given with an overall capacity 
of 1 Gbit/s. The data rate is fairly distributed between the slices with all of them in idle 
mode except one. With as many as sixteen slices, the delays remain stable. However, 
outliers of up to 0.36 ms may be a result of CPU context switches, which are required 
since the hardware provides a maximum of eight threads and queues. The outliers 
down to 0.17 ms are presumably caused by the non-realtime reduced timer/interrupt 
coalescing of the Network Interface Card (NIC) and Operating System (OS), which 
is triggered by the raise in computational load. Therefore, for highly sliced networks, 
performance optimizations of the developed source, real-time kernels and higher thread- 
count CPUs are to be pursued. The following stress test is given in Figure 5.38. 
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Fig. 5.38: Stress testing of scalability with partial overload. 


Under normal operation (left-hand side), 21 coexisting slices can fully utilize their 
allocated data rates with a stable median delay of 1 ms. In comparison with previous 
tests, the delays are higher, because of more slices sharing a slower physical network 
of 100 Mbit/s. On the right-hand side, partial overload is simulated, resulting in seven 
slices trying to exceed their limits and accordingly causing increasing delays up to 3.5s. 
However, the other slices stay unaffected. Therefore, the isolation of network slices 
remains equally robust even with high loads in several slices. Furthermore, the data 
rate remains stable across all realized scalability tests. 

Finally, Figure 5.39 summarizes the measured data rate of the traffic flows given 
in Table 5.1. It starts with only one traffic flow of 950 Mbit/s, which fully utilizes the 
physical network on a best-effort basis. Therefore, even though the traffic continues 
throughout the test, network resources can be allocated to higher priority slices. Thus, 
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Fig. 5.39: Critical infrastructure communication scenario—data rate allocation and slice isolation. 


when passenger Internet and FCD traffic of the ITS slice is generated, the best-effort 
throughput is reduced nearly instantaneously. The same happens when introducing 
protection and smart metering traffic on the SG slice. For an especially critical test case, 
the protection and FCD traffic is simultaneously increased to 50 Mbit/s at 90s into the 
measurement. Figure 5.40 depicts the end-to-end delay of the slices. 

As shown, hard service guarantees are provided during these transitions. Best- 
effort typically stays below the set boundary of 100 ms. Nevertheless, outliers of about 
350 ms occur, which are induced by slice overloads. The delay for smart metering and 
passenger internet stays below 3ms with a median of approximately 1.3 ms and therefore 
satisfy service-level guarantees. The outliers result from the starting phase. During slice 
reconfiguration, the violins of protection and FCD traffic show slight delay variance, 
which does not affect the requirements since it stays mostly below 0.5 ms. 


5.4.3 Data-Driven 5G Network Slicing in the Radio Access Network 


In contrast to the previous sections, in which network slicing in the core network was 
discussed, the focus here is on the Radio Access Network (RAN). In this context, the 
data-driven aspect of network slicing becomes more important, as low-latency slices 
can be realized only via the prediction of emerging network traffic. This relation is 
described in the following subsections. 


5.4.3.1 Introduction to Data-Driven Network Slicing 
Previously, the three main service types eMBB, uRLLC, and mMTC were introduced. The 
balance between uRLLC slices and eMBB slices is particularly challenging to maintain 
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Fig. 5.40: Critical infrastructure communication scenario—end-to-end slice delays. 


within Radio Resource Management (RRM), i.e. the network scheduler that is a crucial 
part of realizing network slicing within the RAN. To understand this relation, end-to-end 
latency components, which were derived from [493], are depicted in Figure 5.41. 
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Fig. 5.41: Components in cellular networks which induce latency [493]. ©[2021] IEEE. Reprinted, with 
permission, from [54]. 


-  Trransport: Latency caused by the transmission of data through the transport net- 
work as when a web page is retrieved from the Internet. 

-  Tcore: Once data has been transmitted via the radio interface or from the transport 
network, it goes through the core network. This introduces additional latency, 
firstly because it is transmitted over an additional network, but also because the IP 
packets are unpacked and packed into different protocols required by the mobile 
network. 

—  Trront-/Backhaul: The connection between gNodeB (5G base station) and core net- 
work introduces additional latency. 

— ThRadio: The physical properties of the transmission channel are the main cause of 
radio latency, but the scheduler (Ts-neq) also adds a significant delay. 
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The latency induced by the network scheduler (from here on Tschea) is part of the 
Tradio component, which is heavily reliant on the type of services or slices present 
in the 5G network. To further illustrate this, suppose there are two slices configured 
within the communication network, one uRLLC slice and one eMBB slice. From now 
on, the uplink direction of data transmission is focused (from UE to gNodeB). The UEs 
or the applications within the eMBB slice are data-rate intensive, which means that as 
many Resource Blocks (RBs) as possible are to be scheduled. By contrast, applications 
in uRLCC slices are not data rate intensive, but are to be scheduled as fast as possible 
to minimize the overall end-to-end latency (cf. Figure 5.41). This means that in order to 
minimize the scheduling latency, it is crucial to issue the scheduling grants before a 
request is even generated. For this, the so-called Configured Grant (CG) or proactive 
scheduling will be introduced in 5G [375]. As the name suggests the scheduling grants 
can be configured in advance to ensure Quality of Service (QoS) requirements . The 
major challenge, however, is that this requires a prediction of future data demands and 
channel qualities to allocate the required amounts of RBs for each network slice. The 
exact prediction of RBs for the uRLLC slice is crucial in this process because end-to-end 
latency will increase significantly if the predicted RBs are too low, which will induce 
retransmissions. If the predicted RBs are too high, the unused RBs will be wasted and 
thus not available for other network slices within the cell. This in order affects the 
aforementioned balance between uRLCC and eMBB slices, as the required prediction 
will induce prediction errors and thus waste resources for the data rate-intensive eMBB 
slices. 

The remainder of this section will describe a data-driven CG-based scheduling 
and simulation framework called SAMUS [54], or Slice-Aware Machine learning-based 
Ultra-reliable Scheduling. 


5.4.3.2 Description and Methodology of SAMUS: Slice-Aware Machine 
Learning-based Ultra-Reliable Scheduling - A Data-Driven Network Slicing 
Framework 
5G-RGS (5G Resource Grid Simulation) Framework Figure 5.42 provides an 
overview of all modules, inputs, and outputs of the SAMUS system. As can be seen, 
the SAMUS system is not only comprised of the actual SAMUS scheduler prototype but 
additionally includes the 5G Resource Grid Simulation (5G-RGS) framework, which 
was specifically developed to evaluate the SAMUS scheduler prototype. There, channel 
conditions and data amounts of each User Equipment (UE) (or the external data used to 
predict the amounts) are provided as input to both modules. The 5G-RGS framework is 
then able to calculate resulting data rates and packet latencies (x Ts¢neq) based on the 
aforementioned channel conditions, the Transmission Time Interval (TTI) , as well as 
the allocated RBs. The last is a product of the data-driven SAMUS scheduler prototype, 
which generates CGs in the form of resource grid allocations based on external data, a 
process which will be described later in this section. 
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Fig. 5.42: SAMUS system overview including all modules, inputs, and outputs. ©[2021] IEEE. 
Reprinted, with permission, from [54]. 


In order to further detail the data rate calculation process of the 5G-RGS framework, 
Equation 5.15 is provided, 


Nur 


(n) 
Data Rate (Mbit/s) = 10°° - 5 o 
n=1 


“TTT . Nrri) (5.15) 


where Nyg describes the UE amount of the slice, TBS“ the Transport Block Size (TBS) 
available for the n-th UE in bit, and N77; describes how many TTIs are available in a 
second (the default here is the New Radio (NR) specification of 1 ms). 

Moreover, packet latencies are calculated via Equation 5.16: 


Latency (ms) = (Is - Ic) - TTI (5.16) 


where I; represents the scheduling interval of a final packet bit transmission and Iç 
the interval of packet creation. Note that latency components like retransmissions or 
other components of (TRadio) are neglected. Based on scenarios in [53], the 5G-RGS 
framework was successfully validated. 


SAMUS Scheduler Prototype As can be seen on Figure 5.43, the inputs for the SAMUS 
scheduler prototype are comprised of channel conditions or Channel Quality Indicators 
(CQIs) and data amounts of each UE or Buffer Status Reports (BSR), which are generated 
from historical data (hence, data-driven). Apart from the fact that data-driven (low- 
latency) CGs can be generated, traditional (latency-intensive) Scheduling Requests 
(SRs) can also be processed by the SAMUS scheduler. The CGs, if predicted correctly 
based on historical data, can reduce the scheduling latency T5-y¢eq down to zero. For 
generating the RBs and the data-driven CGs, the ARIMA (Auto-Regressive Integrated 
Moving Average ) method is utilized to predict the future traffic data demands and CQIs. 

To ensure safe operation of mission-critical slices, resources of critical applications 
are allocated first, while granting the remaining RBs to best-effort (eMBB, no QoS) 
slices (based on the Greedy Network Slicing Scheduler in [53]). Traditional scheduling is 
used whenever a packet could not be transmitted due to prediction errors (effectively 
increasing scheduling latency). As a result, a resource grid in the form of a matrix is 
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Fig. 5.43: Overall SAMUS framework including all modules and their interactions. ©[2021] IEEE. 
Reprinted, with permission, from [54]. 


passed on to the 5G-RGS, which calculates and protocols the Key Performance Indicators 
(KPIs), e.g., data rate and latencies. Finally, CQI and BSR values are updated and a new 
cycle is initiated. 
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Fig. 5.44: Flow chart of SAMUS’s prediction module utilizing ARIMA for training and operation. 
©[2021] IEEE. Reprinted, with permission, from [54]. 
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The details of the ARIMA-based prediction module are depicted in Figure 5.44 as a flow 
chart. There, the dataset associated with a slice is split up into training and validation 
datasets with a ratio of 3 and 3 respectively. Subsequently, the ARIMA model is trained 
in the course of offline learning based on the training dataset in order to predict future 
data, which in turn is utilized by the SAMUS scheduler to generate CGs. The data rate 
that corresponds with these CGs is the so-called data rate predicted. By contrast, the 
actual data rate required is calculated based on the validation dataset. The value for 
the actual data rate required results from the actually transmitted packets within the 
simulation, which are generated in order to test the prediction quality of the ARIMA 
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module. Also, online learning is facilitated to further optimize predictions based on 
the newly acquired data during the simulations. 

These simulations, which were utilized to evaluate the SAMUS scheduler prototype 
are described in the next section. 


5.4.3.3 Evaluation of the Data-Driven Scheduler Prototype SAMUS 
Evaluation Scenario and Parameters As indicated in the previous sections, the 
SAMUS framework was evaluated based on a realistic network slicing scenario . In order 
to be able to compare the novel approach of the SAMUS scheduler to traditional methods 
as well as to present different trade-off strategies, so-called modes were designed and 
utilized. 
The following modes were configured and evaluated: 
- Mode 1: Traditional scheduling with request and grant method (without CGs) 
— Mode 2: Fixed amount of RBs (fixed CGs) 
- Mode 2.1: Average historical data rate used as amount of fixed grants (fixed 
optimistic approach) 
— Mode 2.2: Maximum historical data rate used as amount of fixed grants (fixed 
pessimistic approach) 
- Mode 3: Data-driven CGs (predicted based on ARIMA) 
- Mode 3.1: Predicted CGs as is (No over-provisioning) 
- Mode 3.2: Over-provisioned predicted CGs (With 10 % over-provisioning) 


The mode configuration as well as other simulation parameters like configured and 

simulated network slices are listed in Figure 5.45. 

The three realized slices in the evaluation scenario are also depicted in Figure 5.46, 

which are defined as follows: 

— Smart Grid (SG) slice (uRLCC - Highest priority): The Smart Grid slice is modeled after 
photovoltaic systems transmitting data to regulate energy generation. The National 
Renewable Energy Laboratory (NREL)? provides open data for solar activity, which 
is used to train the ARIMA model and generate data traffic proportional to the solar 
activity. 

- Electric Vehicle (EV) charging slice (uRLLC - High priority): EV charging-point occu- 
pancy data of the German city Bonn? was gathered and data traffic based on this 
dataset is generated for the EV charging slice. 

—  Best-Effort (BE) slice (eMBB - Low priority): A constant rate of 18.96 Mbps is gener- 
ated, which corresponds to the remaining capacity of the cell, to simulate devices 
with high data rate demands and to measure the remaining data rate within the 


2 See https://www.nrel.gov/grid/solar-power-data.html. 
3 See https://new-poi.chargecloud.de/bonn. 
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General Settings 
Channel Bandwidth 20 MHz 
5G Subcarrier Spacing 15 kHz 
Channel Quality Fixed Modulation and Coding Scheme (MCS) of 15 
5G MCS Index Table 64QAM 
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Packet TTI 1 ms 
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Fig. 5.45: Settings and parameters of the simulation framework and the different modes utilized in 
the evaluation of the SAMUS framework. ©[2021] IEEE. Reprinted, with permission, from [54]. 
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Fig. 5.46: Evaluation scenario comprising mission-critical and best-effort network slices to analyze 
different trade-off strategies within the SAMUS system. ©[2021] IEEE. Reprinted, with permission, 


from [54]. 
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non-critical eMBB slices, after mission-critical slices are served by the SAMUS 
scheduler. 


In the following section, the evaluation of the SAMUS framework is presented based 
on this scenario. 


Evaluation Results The 5G-RGS framework described before was utilized to evaluate 
the SAMUS scheduler prototype based on the modes presented earlier, which represent 
different trade-off strategies between the balance of uRLLC latency and the eMBB data 
rate. For this, a 60 min interval was analyzed (cf. Figure 5.45), which represents a time 
frame of highly dynamic activity within the different slices such as the time of sunrise 
in the SG slice or the time of rush hour in the EV slice. 
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Fig. 5.47: Data-rate progressions for the different network slices in mode 1 (5G parameters only- 
traditional scheduling). ©[2021] IEEE. Reprinted, with permission, from [54]. 


In Figure 5.47, the results for mode 1 are depicted, where the average slice data rate in 
Mbit/s is plotted as a function of the simulation time in min. The dotted and solid red 
lines represent the maximum and the actual channel bandwidth utilization, respectively. 
This indicates the efficiency of resource usage, i.e., high channel utilization means 
low RB wastage. The green, black, and red solid lines represent the average uplink 
data rate transmitted for the SG, EV charging, and Best-Effort (BE) slices, respectively. 
Based on the Greedy Network Slicing method, it becomes clear that the available RBs 
are allocated to the higher priority slices at the expense of the BE data rate. This is the 
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desired behavior of the utilized traditional scheduling requests and grants, because 
RBs are distributed exactly as required and no resources are wasted. However, the main 
disadvantage of this approach is that it leads to very high scheduling latency (Ts5¢heq). 
This connection becomes more clear when looking at Figure 5.48. 
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Fig. 5.48: Average BE data rates versus mean and standard deviation of high priority slice latencies 
(averaging window of 2s and hardest 3GPP latency requirement according to 3GPP 23.501 [3]) for 
all modes. Margins for remaining latency components are indicated by the arrows in the respective 
colors of the slices. ©[2021] IEEE. Reprinted, with permission, from [54]. 


In this figure, two different y-axes are depicted describing the ratio of the average 
transmitted data rate from the actual data rate required by the BE slice on the left axis 
(gray bar plot) as well as the mean scheduling latency (Tscnea) of the high priority slices 
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on the right axis (green bar: SG slice; blue bar: EV charging slice; black lines: standard 
deviation), respectively. The different modes are listed on the x-axis. By looking at the 
results for mode 1, it can be seen that the high bandwidth utilization represented by 
the ratio of transmitted data to required data is very high at 92.74 %. At the same time 
however, only 1.53 ms and 2.47 ms margins for other latency components based on the 
hardest 3GPP end-to-end latency requirements [3] are left for the EV charging and SG 
slices, respectively. This results from the utilization of lengthy traditional scheduling 
mechanisms. 

For comparison, the results of the modes in Figure 5.49 can be consulted. As for the 
modes 2.1 and 2.2, depicted in Figure 5.49a and Figure 5.49b, the channel bandwidth 
utilization drops for both approaches, especially for the pessimistic approach. This is 
the result of the fixed allocation of RBs to the mission-critical slices. However, the effect 
of this method on the latency becomes clear again with a look at Figure 5.48. There it 
can be seen that the margins for the end-to-end latency, especially for the pessimistic 
approach, increase to almost 5 ms, since the scheduling latency drops to almost zero 
due to the constant availability of resources. By contrast, the data rate efficiency of the 
BE slice drops down to 52.22 %. Thus, the fixed CGs represent a very latency-focused 
approach, whereas mode 1 maximizes channel utilization. 

The data-driven ARIMA-based mode 3, which is the major contribution of the 
SAMUS framework, represents a good balance between these two extremes, as can be 
seen by looking at the data rates in Figure 5.49c, 5.49d and the latencies in Figure 5.48 
for modes 3.1 and 3.2, respectively. Moreover, the channel bandwidth utilization is 
relatively high with an almost 80 % ratio of actual to requested data rate within the BE 
slice. Additionally, as data amounts of the mission-critical slices can be predicted very 
well, and thus, data can be instantly transmitted, margins for other latency components 
of 4.83 ms to 4.95 ms can be observed. The scheduling latency is zero most of the time. 


5.4.4 Conclusion 


In this section, we presented 5G network slicing approaches for both the core network 
and the radio access network. While the same goal is pursued in both domains, the 
implementation is all the more differentiated. Especially in the RAN, machine learning- 
supported methods will be indispensable, since the prediction of upcoming data traffic 
is a prerequisite for implementing low-latency slices, while still maintaining high 
spectral efficiency. This relation was shown here in this section on the basis of our 
SAMUS approach, which is able to efficiently trade-off resources between different slice 


types. 
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(b) Mode 2.2: Pessimistic Fixed CGs 
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(c) Mode 3.1: Over-Provisioned Predicted CGs 
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(d) Mode 3.2: 10 % Over-Provisioned Predicted CGs 


Fig. 5.49: Comparison of Data Rates Within the Defined Slices and For All Modes Based on the 
Defined Evaluation Scenario. ©[2021] IEEE. Reprinted, with permission, from [54]. 
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Abstract: For mobile communication networks, radio spectrum resources have always 
been a scarce commodity. With the cultivation of millimeter Wave (nmWave) wave- 
lengths, a vast amount of spectrum at frequencies above 24.25 GHz has become available 
to serve the demands of enhanced mobile broadband services and applications of fifth- 
generation of mobile communications (5G). However, the higher carrier frequencies 
compared with the heretofore allotted spectrum comes with novel challenges for the 
operation of a cellular network: The more significant propagation losses require di- 
rectional/beam antennas and their directivity needs to be adjusted permanently and 
individually per user. In addition, the poor obstacle penetration necessitates a careful 
beam alignment based on Line-Of-Sight (LOS) conditions. In case of obstructions, sig- 
nal reflection paths need to be leveraged, which may be volatile and time-consuming 
to discover. By means of signal quality measurements, a self-contained beam tracking 
may maintain the LOS or virtual LOS via reflections to mobile devices. As a further 
feature, the directional knowledge of the base station antenna beams can even be 
exploited for a bearing-like localization approach allowing for an enhanced network 
positioning service compared with cell-level approaches. The sophisticated Software- 
Defined Radio (SDR)-based mmWave platform allows for the experimental evaluation 
of the mentioned features. The results prove the potential of mmWave communications 
for various vehicular and logistics use cases. The lessons learned will go into future 
research directions such as smart radio environments. The novel technology of Recon- 
figurable Intelligent Surface (RIS) is a promising strategy for improving the capabilities 
of the general environment to supply better radio conditions to a wireless channel in 
non-LOS conditions. For example, a RIS can purposefully redirect the base station’s 
mmWave pencil beam to reach a device in an obstructed area and thus extend the 
network coverage. Future integrated, radar-like sensing capabilities of communication 
networks are expected to operate at mmWave frequencies due to large bandwidth, high 
directionality, and low multipath features promising high-quality measurements. We 
show that the channel information of current mmWave systems, beam orientation in 
particular, already enables novel sensing applications. 
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5.5.1 Introduction 


Besides the growing number of Internet-of-Things (IoT) devices with low data rate but 
high coverage requirements and applications demanding reliable or low-latency data 
transfer, another main direction of impact of fifth-generation mobile networks (5G) 
focuses on the enhanced Mobile Broadband (eMBB) services. It is believed that fu- 
ture applications such as augmented, virtual, or extended reality necessitate a high- 
performance wireless network infrastructure. 

While optimizing the utilization of the traditional sub 6 GHz radio spectrum in 
terms of spectral efficiency, this resource is already heavily used. However, with 5G the 
third Generation Partnership Project (3GPP) targets additional spectral resources in the 
mmWave domain (particularly from 24.25 GHz to 52.6 GHz in Frequency Range 2 (FR2)) 
[2, Table 5.1-1]. Frequencies in the THz domain will also be targeted in future mobile 
networks promising even larger bandwidths—and use thus a vast amount of resources. 
Although these resources ought to enable an enhanced throughput at the air interface, 
novel challenges arise due to the higher frequencies. 

Unlike the popular misconception, the more severe path loss itself is not the main 
issue, because higher frequencies allow for an increased antenna gain within the same 
space constraints. With this, the path loss itself is more than compensated. Never- 
theless, the increased antenna gain is achieved by a more distinct directivity, which 
demands a proper antenna alignment. Phased Array Antennas (PAAs) resolve that issue 
by interconnecting multiple antenna elements, so that a sophisticated superposition 
of the processed signals allows for an adjustable radiation characteristic known as 
beamforming or spatial filtering. A PAA applies phase shifts to the signals of the indi- 
vidual antenna elements. For example, at a Uniform Linear Array (ULA), N antenna 
elements are uniformly spaced with some distance d (mostly at half a wavelength, so 
d= Ay. To create a beam directivity that points towards a direction 0, phase shifts of 0, 
$, 2¢ to (N - 1)¢ are applied to the respective antenna elements O, 1, 2 to N- 1 with 
p= ad cos 0 [39, Chapter 6]. Put simply, the number of elements N determines the 
beamwidth and antenna gain. In general, a larger N leads to a higher gain and a more 
focused beam. 

This means that a steerable directivity is feasible and can be achieved electrically or 
by software. The transmitter antenna’s beam can be dynamically aligned to a receiving 
antenna and vice versa facilitating radio propagation by high transmit-and-receive 
antenna gains. However, radio signaling as part of the control plane of the Radio Access 
Network (RAN) needs to carry out this alignment task in a timely manner, which could 
be challenging due to the volatile radio conditions and the users’ mobility. 

For example, the alignment could be performed by means of a potentially time- 
consuming discovery procedure such as beam sweeping. The coverable angular space 
is iteratively sampled by switching the beam through different pointing directions. In 
doing so, the beamwidth constitutes a trade-off between a higher gain and a reduced 
number of iterations required to sample the complete angular space. While a precise 
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Fig. 5.50: Exemplary heatmap illustration of signal quality measurement during an exhaustive 
search (beam sweep). The signal quality is given as Error Vector Magnitude (EVM) with lower values 
representing better signal qualities. The red area represents the beam-pointing directions with a 
suited signal quality. ©[2020] IEEE. Reprinted, with permission, from [268]. 


beam alignment is generally feasible in the analog domain, a number of quantized 
main lobe/beam-pointing directions as large as the number of antenna elements is 
often used to span an angular grid, where the selectable beams have the least possible 
overlap [148, Chapter 6]. 

Although the exhaustive sweep procedure can be accelerated by using multiple 
beams in parallel, each beam requires its separate RF-chain which are expensive with 
regard to their costs and energy demands. For this reason, it is believed, that analog 
or hybrid beamforming, where only a small number of parallel beams is available, is 
applicable for mmWave communications. 

During a sweep, the measurements of the signal quality can be interpreted as a 
heatmap. Figure 5.50 depicts such a heatmap, with the most-suitable directions rep- 
resented by the red spots. In a mobile network like 5G, the base station continuously 
transmits some reference signals at different beam directions in the downlink, so the 
User Equipment (UE) is able to select the strongest one, while performing a sweep with 
its receiving beam. Since such systems are defined for Time Division Duplex (TDD), 
channel reciprocity can be assumed and the UE can use the determined beam configura- 
tion for initially accessing the network and reporting back the suited beam direction pair. 
These directions can subsequently be used for further transmissions/receptions until 
the mobile device has moved or some obstruction occurs, which means the measured 
beam—dependent signal quality becomes outdated. 
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5.5.2 Beam Tracking for Interruption-Free High-Performance Communications to 
Mobile Devices 


A proper beam alignment is as crucial to establishing a communication link as it is 
to maintaining it by tracking the mobile UE. The main drawback of the exhaustive 
search is its large search space and exploration time. Also, during this exploration, 
the beam points in various directions with weak signal quality, which may lead to 
a heavily reduced radio link performance or even a connection loss. For this reason, 
other procedures take into account a position or previous direction information and 
potentially the device mobility to facilitate an interruption-free utilization of the radio 
resources for purposeful data transmissions. This means, that once a proper align- 
ment is initially discovered, beam tracking is preferably applied to follow the device 
movement. Only in case of a connection loss due to, say, sudden blockage, another 
comprehensive exploration might be required for radio link recovery. 

In our works [269, 270], we analyze the applicability of beam tracking for supplying 
mobile users with mmWave radio links. 

As a proof of concept, the position of a mid—flight drone/Unmanned Aerial Vehi- 
cle (UAV) is recorded by an optical reference system allowing for a geometry-based, pre- 
cise calculation of the required beam-pointing direction. Figure 5.51 gives an overview 
of the experimental setup. The UAV movement describes an arc at a fixed distance of 
1.8 m from the stationary active antenna/PAA. The central experiment logic controls 
this movement, processes the UAV position, sends corresponding beam-pointing com- 
mands and logs the measured performance indicators such as signal quality and data 
rate. In addition to the PAA’s beam alignment, the passive horn antenna at the UAV 
can be aligned horizontally by means of the UAV’s yaw rotation. 

The evaluation results are condensed in the time-series graphs of Figure 5.52. When 
only the yaw rotation of the UAV is used to align the passive antenna at the UAV, the 
communication link is active only within a small range around the center direction, 
which is where the PAA is configured to point at in the static case. On the contrary, when 
only the PAA’s pencil beam is continuously aligned towards the UAV, the misalignment 
of the horn leads to connection losses. Since the horn has a wider beamwidth, the 
tolerance for a misalignment is larger. Finally, when both transmitter and receiver 
antenna are continuously aligned to each other, a stable link is observed in terms of a 
constantly high data rate of about 2.8 Gbit/s. This proves the general applicability of 
mmWave communications utilizing PAAs for beam alignment in scenarios with mobile 
users. 

Since external position knowledge might not always be available and could require 
an additional, beam-alignment independent control link (for example at a conventional 
sub-6 GHz band) for reliable reporting, a self-contained beam tracking approach based 
on signal quality measurements is evaluated in [270]. Besides keeping the beam at a 
direction with a still acceptable signal quality, better beam-pointing directions need to 
be explored during a temporary impairment of the signal quality. In general, there is a 
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Fig. 5.51: Experimental setup for real-time mmWave beam alignment studies with a flying UAV based 
on [269]. The motion capture system provides position information of the UAV, which is processed to 
a beam-pointing direction and sent to the PAA as a control command. 
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Fig. 5.52: Evaluation of mmWave beam alignment with a mid-flight UAV based on [269]. While con- 
nection losses occur without updating the directivity, a seamlessly high data rate can be achieved by 
tracking the movement of the mid-flight UAV with both transmitter and receiver antennas. 
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trade-off between the detail of exploration and the perceived signal quality by leverag- 
ing the aligned beam’s gain, since the beam needs to be intentionally misaligned to 
explore the device’s moving direction. Assuming that the device motion tracking under 
Line-Of-Sight (LOS) conditions requires only gradual changes in the beam-pointing 
direction, the search space can be substantially limited to the adjacent directions. For 
example in [270], a 3 x 3 grid in the azimuth and elevation plane of the angular space of 
beam directions is spanned centered at the last acceptable signal-quality direction. In 
doing so, every scanning cycle consists of as few as nine signal quality measurements 
and the subsequent search grid is centered on the direction with the highest signal 
quality. Although this appears to misalign beams in most cases, the small amount 
of measurements per cycle allow for low grid spacing below the beam width as long 
as the sample rate is significantly higher than the device’s relative angular velocity. 
With this, the connection can still be maintained during the exploration. The minor 
reduction in antenna gain due to the slight misalignment can be compensated for by the 
communication system. For an experimental evaluation of this approach, the device 
motion is emulated in a reproducible fashion with a precise reference by using a rail 
system. 

The statistical results of this empirical analysis is depicted as violin plots in Fig- 
ure 5.53. While the signal quality is represented by the Error Vector Magnitude (EVM), 
where a lower value corresponds to a higher signal quality, the link performance is 
evaluated in terms of data rate. The emulated mobile device velocity is converted to 
the related maximal tracking dynamics from the antenna’s perspective. A small explo- 
ration grid spacing of A = 1° reduces the decline of the antenna gain due to a reduced 
misalignment only at low relative velocities, since this step size is not sufficient to keep 
track of the motion at higher dynamics. A larger spacing of A = 5° deteriorates the link 
performance or may even lead to connection losses due to severe beam misalignments 
during exploration. Finally, the grid spacing needs to be fitted to both the device’s 
velocity and the antenna’s beamwidth. In the conducted test setup, a grid spacing of 3° 
empirically turned out to be a reasonable tradeoff. The results of laboratory evaluation 
thus prove, that a high-performance communication link can be maintained even for 
considerable device velocities. 

With respect to the utilization of the novel radio resources at the mmWave domain, 
this beam tracking approach allows for efficient utilization of the spectrum by reducing 
the link outage due to lengthy exploration phases. In addition, the directional trans- 
missions via pencil beams facilitate dense spatial reusability of these resources, since 
the interference within the mobile network is reduced. 

As an outlook, future beam tracking techniques may incorporate reinforcement 
learning approaches to solve the exploration—exploitation tradeoff dilemma between a 
comprehensive exploration of beam-pointing directions with their associated signal 
qualities and a perfectly aligned beam with ideal signal quality conditions for data 
transmissions. In doing so, dynamic and reactive adjustments of the search space 
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Fig. 5.53: Statistical beam-tracking evaluation. For exploring grid spacings A of 1°, 3° and 5°, the 
signal quality in terms of Error Vector Magnitude (EVM) and the link performance in terms of data 
rate is analyzed under different tracking dynamics. The tracking dynamics correspond to different 
mobile device velocities that are reproducibly emulated by a rail system. The constantly low EVM 
values at A = 3° constitute a reasonable configuration with a stable link performance even at the 
highest tracking dynamics. ©[2019] IEEE. Reprinted, with permission, from [270]. 


(grid shape and spacing, for example) are conceivable according to the anticipated 
movement of the device. 


5.5.3 Dual-use of Beam Alignment Information for Positioning of Mobile Devices 


Although a proper beam alignment embodies a new challenge to mobile networks, once 
gathered the direction information could also be used for a bearing-based positioning 
service, as addressed in [268]. Conventionally, cellular network-based positioning 
utilizes signal-strength measurements in conjunction with propagation loss models, 
signal travel time, or propagation delay measurements (such as those used for the timing 
advance mechanism) for distance-based positioning or lateration. With the necessity 
for directional transmissions, angle-based methods utilizing direction information as a 
bearing are conceivable at the mmWave domain. By means of two or more intersecting 
bearings and known base station positions, a user-position estimate can be provided on 
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Fig. 5.54: A cross-bearing-based positioning utilizing mmWave beam alignment information. With 
known base station positions and beam directions as bearings, the mobile device position can be 
estimated as the intersection of the bearings. ©[2020] IEEE. Reprinted, with permission, from [268]. 


top of the ongoing wireless communication. In doing so, the accuracy strongly depends 
on the distance or constellation and the resolution of the direction finding. 

Figure 5.54 illustrates the basic concept of this approach, which derives its origins 
from sea travel’s cross-bearing. The estimated position Îtarget is defined as the position 
vector that minimizes the squared distance to the (two or more) lines spanned by the 
base station position vectors r; and its beam-pointing direction d; as direction vector. 
As a result, the position estimate is given as least squares approximation (with I as 
identity matrix): 


-1 
Parget = (x = adt) (e- at 


From an exhaustive sweep, the direction vector d; is estimated based on the beam- 
pointing direction with the highest signal quality. Since the area with reasonable signal 
quality turns out to be rather flat but noisy, this estimate is rather imprecise. For this 
reason, the centroid of the red region, which contains the highest signal quality, is taken 
as the improved direction estimate. In addition to the geometric location and orientation 
of the base station antennas, systematic deviations between the commanded and the 
actual pointing direction of the antenna beams are compensated. The experimental 
evaluation indicates the applicability of this approach and is depicted in Figure 5.55. In 
general, within the laboratory setup, a 3D Euclidean positioning error in the centimeter 
range is observed. The post-processing compensation for systematic deviations further 
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Fig. 5.55: Statistical evaluation of the positioning performance based on laboratory experiments. The 
3D Euclidean positioning error lies in the centimeter range and can be further reduced by compensat- 
ing systematic deviations in the direction estimates towards the UE. ©[2020] IEEE. Reprinted, with 
permission, from [268]. 


improves the estimated position and thus illustrates the potential of bearing-based 
mmWave positioning. Further details about the experiments can be found in [268]. 

The direction exploration as well as the post-processing compensation may be 
subject to machine learning-based optimization techniques introducing automated 
trade-off decisions between the resource utilization and positioning precision during 
runtime. 

Nevertheless, hybrid procedures could combine distance measurements and bear- 
ings for a further enhanced positioning service of mobile networks. The large available 
bandwidth at the mmWave domain could be attractive for pseudoranging or Time Dif- 
ference of Arrival (TDOA) considerations. The application of TDOA-based positioning 
utilizing the Ultra-Wideband (UWB) technology is analyzed in more detail in Section 3.5. 

Due to the challenging propagation characteristics at the mmWave domain, a dense 
deployment of mmWave base stations is required and may lead to an enhanced system 
performance by utilizing approaches such as Coordinated MultiPoint (CoMP) or Dual 
Connectivity (DC), so the connection of one UE with multiple base stations at a time. 
Within a mobile network, the proposed positioning mechanism can be applied for both 
the downlink as well as the uplink direction. While the Angle of Departure (AoD) of the 
base station downlink beams could be signaled to the UE together with a map of base 
station positions to perform the positioning at the UE, the Angle of Arrival (AoA) of the 
base station uplink beams could be utilized to perform the task on the network side. 
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With the former providing reportedly sensitive information of base station positions 
to the UE, the latter alternative requires an active transmission of the UE, but no addi- 
tional utilization of radio resources/signaling overhead leading to a resource-efficient 
positioning solution. In both cases, a preferably accurate direction estimate is required 
for positioning, which might be feasible only in case of a high-resolution sampling 
of the angular exploration space. Additionally, a Dilution of Precision (DOP) can be 
observed at acute angles between the intersecting bearings, so an elaborate placement 
of base station antennas might be advantageous. 


5.5.4 Integration of High Priority mmWave Links into an End-to-End System 
Architecture 


As part of 5G, the mmWave spectrum contributes to the available resources at the Radio 
Access Network (RAN). End-to-end applications between users (humans or machines) 
and services come with various requirements, which differ greatly from each other. 
Within an end-to-end system architecture, a mobile network needs to be agile and utilize 
the available resources at both, the core network and the RAN so that the application 
requirements can be fulfilled. As already elaborated in Section 5.4, network slicing is 
introduced to define virtual networks with certain configurations regarding throughput, 
latency, reliability, and others. Based on this, each application is dedicated to a specific 
slice that not only supplies the required performance, but also remains unaffected 
by traffic fluctuations or shortcomings of other slices of the same mobile network. To 
illustrate the potential of this slicing, our work [266] presents a system concept and an 
experimental evaluation of an unaffected and prioritized communication link among 
other best-effort traffic. 
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The overall system architecture design is depicted in Figure 5.56. To ensure the Qual- 
ity of Service (QoS) requirements, the proposed Software—Defined Networking (SDN) 
Management and Network Orchestration (MANO) controller affects both RAN and core 
network. At the RAN, multi-RAT base stations are capable of directing the data traf- 
fic flow to and from UEs through different air interfaces according to the guidelines 
from the SDN MANO controller. For example, a conventional LTE link can be used 
in parallel with a 5G New Radio (NR) link at the mmWave domain at different Dis- 
tributed Units (DUs) of the same base station. At the same time, the base station Central 
Unit (CU) is connected to the core network, where Virtual Network Functions (VNFs) 
dynamically allocate resources as required to operate the appropriate services. Finally, 
this design ensures end-to-end QoS within the whole mobile network. An experimental 
proof of concept study can be found in [266], where the Software-Defined Radio (SDR) 
and SDN-based components of the experimental setup allow for high flexibility and 
adaptability. 


5.5.5 Intelligent Reflectors for Enhanced Propagation and Coverage under 
Non-Line-of-Sight Conditions 


In addition to the discussed propagation loss and the need for directional transmissions, 
mmWave signals barely penetrate materials. As a consequence, the outdoor-to-indoor 
coverage is rather poor and obstructed areas need to rely on the presence of suited 
reflection paths. These reflection paths in turn are volatile and need to be explored by 
means ofa potentially time-consuming discovery procedure such as the aforementioned 
beam sweeping. The beam management needs to provide routines to recover from link 
blockages and to switch between propagation paths, whenever the LOS condition 
varies. Taking the NLOS propagation into account, several challenges arise regarding 
the mobility support, which is doubtless a crucial feature of mobile radio networks. 
However, especially in dense urban scenarios, frequent LOS obstructions may demand 
sophisticated procedures to facilitate radio links via reflection paths. 

With the novel concept of smart radio environments and the Reconfigurable Intelli- 
gent Surface (RIS) technology, the radio channel itself becomes modifiable to enhance 
the transmission performance. While much research concentrates on the optimization 
of transmitter and receiver techniques, the idea of this concept is to deploy elements (sur- 
faces) with controllable reflection characteristics in the environment. Hence, these RISs 
act as dynamically controllable passive reflectors. In this way, they enable the purpose- 
ful utilization and adjustment of reflection paths allowing for an enhanced tracking 
capability of user devices with an obstructed line-of-sight to the base station. [46] 

In our work [267], we highlight the potential of RISs for an enhanced mmWave net- 
work coverage at an urban campus scenario, as illustrated in Figure 5.57. The simulation 
model is based on our previous work [631] and extended to also account for RIS. 
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Fig. 5.57: Simulation scenario for a RIS-enhanced coverage study. With LOS coverage marked in dark 
green, colors according to the RISs are used for the road sections with NLOS conditions, which are 
covered by the respective RIS reflected paths. ©[2020] IEEE. Reprinted, with permission, from [267]. 


As depicted in Figure 5.58, the base station deployment only leads to a poor LOS cov- 
erage. The corresponding path loss lies only 23 % within the expected link budget 
of 142 dBm. However, the utilization of RIS reflection paths enhances the overall net- 
work coverage. For the alignment of the RIS reflection, distorted information about 
the target location is assumed. The true UE position is superimposed by a zero-mean 
normal distribution with standard deviation o. The Empirical Cumulative Distribution 
Function (ECDF) of the path loss illustrates, that a coverage of 91% is achievable in the 
case of o = 3 m, for example. Even a comprehensive coverage is feasible due to the RIS 
placement in this evaluated scenario. 

As a result, the deployment of RISs for smart radio environments may not only 
lead to an enhanced network coverage; it also allows for an improved efficiency in 
terms of energy and spectral resources, since the controlled reflections may reduce 
the exploration overhead of beam management algorithms as well as the required 
transmit power for sufficient signal strengths at the receiver. Nevertheless, an elaborate 
control of the RISs may require some radio resources for both measurements and signal 
suitable reflection paths. Finally, the RIS placement task may play a part in the network 
planning procedure and could be realized by means of machine learning approaches 
focusing on a cost-efficient way of providing a comprehensive network coverage. 
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Fig. 5.58: Simulation results of RIS-enabled communication via reflection paths. Even with a slightly 
distorted target position information (zero-mean normal distribution with standard deviation ø), a 
comprehensive coverage of 91% is achievable for ø = 3 m, while the pure LOS coverage amounts to 
only 23 % in this scenario. ©[2020] IEEE. Reprinted, with permission, from [267]. 


5.5.6 Towards Perceptive mmWave Networks by Channel Sensing 


Over the past decade, it has been shown that radar and communication functionalities 
may be provided by Joint Communication and Radar/Radio Sensing (JCAS) systems, 
because the employed OFDM waveform of current 4G/5G networks and WLANs is also 
suitable for radar services [657]. Nowadays, a deep integration of radar-like sensing 
services is expected for 6G, thus allowing communication networks and their entities 
to become perceptive of the immediate surroundings. Such information may then be 
used to optimize the network performance by, say, supplying mmWave beam sweeping 
and tracking algorithms with user position and mobility information. Such information 
may also be used in the sub-6 GHz band, e.g. to assist handover decision making. 
Radar systems are capable of detecting targets by analyzing the reflected waves of 
its own transmit signals over time. Through the use of large bandwidths and sweeping 
of highly directional antennas, it is possible to estimate distance, velocity, and angle 
information of the detected targets with high accuracy. Typical radar systems operate at 
very high frequencies, for example mmWaves or beyond, where multipath-based distor- 
tions are mitigated such that the high resolution due to bandwidth and directionality 
comes to fruition. Moreover, the imaging of the surroundings is also possible with radar 
technology. Considering the particular compatibility between radar requirements and 
mm Wave communications, this is an opportunity for network operators to offer new 
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Fig. 5.59: (a) Scalability analysis of a subsidence process (gradual sinking of up to 5 mm) affecting 
a suburban house. (b) Distribution of mean incurred error throughout Az-traversal for various UE 
mounting setups. ©[2021] IEEE. Reprinted, with permission, from [253]. 


services to the public, such as sensing-assisted traffic, but it also offers the prospect of 
process optimization in industrial facilities employing private network solutions. 

However, the integration of radar functionality into mmWave communications still 
has a long way to go. For example, there is a need for hardware and signal processing 
enhancements. Nonetheless, radio-based sensing features such as user positioning 
have been available for more than two decades and steadily been enhanced ever since. 
Such sensing is enabled by analyzing the properties of one or more channels between 
the network and the user equipment entities. A large number of channel-based services, 
such as vehicle detection and classification (cf. Section 4.2), have already been proposed 
in literature [753]. These have been designed predominantly for sub-6 GHz WLANs, yet 
led to the recent launch of IEEE 802.11bf Wi-Fi Sensing standardization which even 
pertains to the mmWave domain. 

With the inclusion of mmWave frequencies into the 5G standard, 5G positioning 
was successively adapted to allow the facilitation of mmWave beam information which, 
for example, enables angle-based positioning, thus enhancing the network’s location 
services by new methods (cf. Section 5.5.3). Our work [253] followed a similar approach 
and considered the use of pencil beam orientation information to enhance traditional 
channel phase tracking-based measurements of relative motion and vibrations. By com- 
bining the movement information of several UEs along the LOS path beam orientations, 
we showed that millimeter range motions may be reconstructed in 3D space with less 
than 10 pm error. (See Figure 5.59 for the detailed results of a sample scenario.) Our full 
scalability analysis suggests that the usage of 4 to 5 distinct spatial link opportunities, 
which are expected in typical urban deployments between a single TX-RX pair, is a 
sensible choice. Therefore, high accuracy 3D motion tracking could in the future be 
conducted with a single-user device exploiting several distinct propagation paths to the 
network. Our ongoing work is evaluating the achievable beam orientation accuracy and 
the consequences of misalignment in the prior two contexts. Nonetheless, by showing 
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that future sensing features may already be realized with current mmWave technology, 
we point out the need for more research on this area because such techniques may also 
allow mobile networks to become more perceptive of its surroundings. 


5.5.7 Concluding Remarks 


In this section, the potential of mmWave communication has been elaborated with 
insights into several promising areas of research. The novel mmWave spectrum for 
mobile networks embodies a great opportunity for various future applications due to 
the vast amount of available spectral resources as well as the peculiar radio channel 
conditions. The demand for directional communication offers less interference and 
better spatial reuse of the spectral and time resources. According to the results of 
our presented works, the challenge of a proper beam alignment appears manageable 
and beam-based positioning can be provided as an additional feature. Also, there are 
concepts for integrating the novel spectrum into the overall mobile network capacity 
in terms of network slicing as shown in terms of a systems perspective. Last, the field 
of application will be further enhanced by the introduction of the novel concept of 
smart radio environments, where RISs support the propagation of mmWave beams 
and thus enhance the comprehensive network coverage in obstructed areas. Via an 
outlook on future perceptive networks, we showed that current networks could already 
partially enable novel sensing applications as expected for 6G. Therefore, research 
should further investigate and test sensing techniques that employ mmWave beam 
orientation information alongside the ongoing development of a 6G JCAS framework. 


5.5.8 Acknowledgments 


In addition to the CRC 876, part of this work has been partially supported by the Ministry 
of Economic Affairs, Innovation, Digitalization and Energy of the State of North Rhine- 
Westphalia (MWIDE NRW) along with the Competence Center 5G.NRW under grant 
number 005-01903-0047. 


6 Privacy 


6.1 Keynote: Construction of Inference-Proof Agent Interactions 


Joachim Biskup 


Abstract: To comply with the social issue of preserving privacy or pursuing other confi- 
dentiality requirements, we outline a broad range of conceptual solutions to a task of 
computing engineering: configuring the formal interactions of an individual’s informa- 
tion system agent with the client agent of acommunication partner in an inference-proof 
manner. Here inference-proofness means the following. A security mechanism shield- 
ing the system agent under the individual’s control is reducing the information content 
of the messages sent to the client agent such that the partner would not be able to learn 
any information to be kept confidential under the individual’s confidentiality concerns. 
This goal has to be provably guaranteed even if the communication partner as a rational 
reasoner will exploit not only a priori knowledge about the application underlying 
the communication acts but also additional background knowledge comprising both 
a complete specification of the interaction semantics and the full awareness of the 
security mechanism. 


6.1.1 Foreword: Intended Audience 


This contribution gathers, unifies, clarifies, and explains in depth the concepts and 
insights of a dedicated line of research and development within one of the basis sub- 
fields of IT-security, namely user-centric, self-determinative, and computer-supported 
enforcement of confidentiality interests, including the preservation of privacy at the dis- 
cretion of the individuals involved. The own contributions started around two decades 
ago (see the brief bibliographic notes in Section 6.1.7), at the beginning inspecting the 
evaluation of sequences of closed queries by a database management system, such 
as a relational database system, the abstract semantics of which are based on a frag- 
ment of first-order logic. Understanding IT-security as a comprehensive problem of 
both organizational and computational issues, over the time, it becomes more and 
more demanding to expand to further operations like, e.g., transaction management 
for mixed query and update operations by more expressive information processing 
systems. At the end, dealing with procedural program execution as a service of any 
kind of knowledge- and belief-management system would be the ultimate goal. 

The broader the range of particular operations by specific computing systems, 
in each case treated by appropriate highly sophisticated means, the more urgent the 
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need to identify useful abstractions for the computational issues and to reconsider 
the supportive organizational issues becomes. Regarding the first point, we abstract 
the object of protection to be the epistemic state of an intelligent computing agent 
participating in a multi-agent computing configuration. Regarding the second point, 
we explicitly expand all the intuitive assumptions underlying the various computational 
protection efforts into a framework of eight basic features, to which then the formal 
notion of the kind of protection we want to mathematically verify refers. 

Experts in the fundamentals of IT-security might benefit from this article by learning 
the carefully elaborated essence of a large number of highly specialized publications. 
Computer scientists with a broader expertise in the field of confidentiality enforcement 
might be encouraged to generate a similar retrospective of their own line of research and, 
maybe, to fill some of the many gaps in the list of operations already treated. Researchers 
working on machine learning and embedded systems with a strong interest on security 
issues might also be triggered to fill those gaps for which they have appropriate expertise. 
Other researchers working on machine learning and embedded systems might gain 
detailed exemplary insights into the subtleties of integrating purely functional aspects 
with concise security considerations. They might further reflect on the related notions 
of (syntactic) data on the one hand and inferred (semantic) knowledge and belief on the 
other hand underlying their own work, and they might consider the design of an overall 
system architecture of their interest where the security measurements are appropriately 
located. Finally, admitting that this article deals only with a possibilistic version of 
confidentiality, all kinds of readers might think of and contribute to generalize the 
entire approach to probabilistic considerations. 


6.1.2 Confidentiality-Preservation and Inference-Proofness 


Since time immemorial, among many other activities, and in a closely intertwined man- 
ner, people have reasoned as individuals by acquiring, structuring, keeping, and ex- 
ploiting information to make up their respective minds and behaved as social creatures 
by communicating with others. With the advent of computing technologies, individually 
dealing with information and socially communicating have been partly delegated to 
computing agents. On the one hand, the delegation is meant to facilitate routine tasks 
or even enhance human capabilities. 

On the other, depending on the context, as delegators, individuals at their discre- 
tion or groups of them according to some socially accepted norm aim to still control 
the computing agents executing protocols as their delegatees, or at least the human 
delegators should appropriately configure the computing delegatees. 

Being aware of the resulting reduction, and somehow simplifying, we can map 
concepts of human reasoning and communication to the inference protocols and inter- 
action protocols of their computing agents and, correspondingly, actually performed 
human activities to protocol-complying computing process executions. Under such a 
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Fig. 6.1: Two reasoning and socially communicating human individuals and their protocol-based 
interacting computing agents as part of a larger community and the corresponding multi-agent 
computing configuration. 


reduction, and even more simplifying, a group of human individuals is modeled to be 
complemented by a multi-agent computing configuration. In this model, each human 
individual controls a dedicated computing agent that, at least partly and by means of 
protocol executions, both deals with the information owned by that individual, in partic- 
ular by internally deriving an epistemic state from a chosen information representation, 
and mediates the communications of that individual, in particular by sending and 
receiving messages according to one or more agreed interaction protocols. Figure 6.1 
illustrates the sketched scenario. 

Though, in principle, each individual can act in diverse roles and, correspondingly, 
each controlled computing agent can execute diverse protocols, we further specialize 
the model sketched above in focusing on only two individuals together with the respec- 
tive computing agents. One individual is seen as an information owner controlling an 
information system agent, and the other individual is treated as a cooperating commu- 
nication partner employing a client agent. Moreover, to enable cooperation, in principle 
the information owner is willing to share information with the communication partner. 
However, complying with privacy issues or pursuing other confidentiality requirements, 
as an exception from sharing, the information owner might want to hide some specific 
pieces of information. 
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Fig. 6.2: The framework of a defending information owner with his information system agent and an 
attacking communication partner with his client agent (showing suspected but inaccessible parts in 
blue). See Figure 1 on page 82 of [73], © IFIP/Springer 2020. 


Slightly more concisely, and visualized in Figure 6.2, we assume the following framework 
with eight features. 


1. 


[Epistemic state of the information system agent as a single object of protection.] 
The human information owner does not deal with information processing and 
reasoning by himself but only provides the inputs to the information system agent 
under his control. At each point in time, that agent is internally deriving a formally 
defined epistemic state. 

[Mediation of human communications by interacting computing agents. ] 

Once having agreed on cooperation, the human information owner and his human 
communication partner do not communicate directly with each other, but only 
mediated by the computing agents under their respective control. 

[Dedicated access permissions for information sharing. ] 

Independently of the actual epistemic state, the information owner has granted 
dedicated access permissions to his communication partner. These permissions 
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declare that over the time the client agent of the partner may interact with the 
information system agent of the owner following some explicitly chosen interaction 
protocols that exclusively refer to the internal epistemic state of the information 
system agent. 

[Exceptions by explicit prohibitions designating pieces of information.] 

Also independently of the actual epistemic state, the information owner has ex- 
plicitly declared exceptions from the dedicated access permissions in the form of 
prohibitions. Each prohibition specifies a piece of information that the communi- 
cation partner should not be able to learn. More precisely, with each prohibition 
expressed in terms of the information system agent and thus in reference to possible 
epistemic states, the communication partner should never be able to become sure 
about the actual validity in the epistemic state of the information system agent. In 
other words, from the partner’s point of view it should always appear to be possible 
that the prohibited piece of information is not valid in the epistemic state of the 
information system. 

[Partner suspected to reason about the validity of prohibitions.] 

Though the client agent is restricted to follow the interaction protocols of the access 
permissions exactly, the human communication partner can choose any sequence 
of permitted commands. Moreover, the partner is assumed to have unlimited com- 
putational resources when rationally reasoning about the validity or non-validity 
of a prohibited piece of information. 

[Security mechanism implanted in the owner’s information system agent.] 

To actually enforce the confidentiality requirements of the information owner, the 
information system agent is enhanced by some implanted security mechanism 
that should shield the underlying information processing from direct contact with 
the client agent. That security mechanism first inspects each message to be sent 
by the information system agent to the client agent according to the pertinent 
interaction protocol whether a violation of the information owner’s confidentiality 
requirements would be enabled on the side of the communication partner. If this 
is the case, the security mechanism then alters the message such that the message 
is still as informative as possible but all options for a violation are blocked. 
[Reasoning supported by a priori knowledge and background knowledge.] 

First of all, the communication partner’s rational reasoning about the internal epis- 
temic state of the information system agent is based on the messages exchanged 
by the respective computing agents. These messages are completely known to both 
agents. Additionally, the partner’s rational reasoning is presumed to be supported 
by some a priori knowledge about the application dealt with in the cooperation 
between the two individuals involved and additional background knowledge com- 
prising both a complete specification of the interaction semantics and the full 
awareness of the security mechanism (possibly even including the prohibition dec- 
laration) and, most notably, nothing else. 
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8. [Principle inaccessibility of the partner. ] 
The internals of both the human communication partner and his client agent are 
considered to be principally inaccessible for the information owner and his system 
agent. This implies that the latter ones can only rely on assumptions about the 
details of the a priori knowledge and a postulation about the background knowledge 
available to the former ones. 


These features can still be formally instantiated in various ways. In all cases, we follow 
a martial-sounding but common terminology, which ignores that in many scenarios 
an individual involved as communication partner will primarily be treated as cooper- 
ating in a friendly manner, rather than as a “total enemy”. At least partially trusted 
for consciously sharing information in principle and correctly executing the agreed 
interaction protocols, the communication partner—together with the client agent con- 
trolled by him—is denoted as a semi-honest attacker, suspected to potentially aiming to 
maliciously infer the actual validity of pieces of information that the information owner 
has declared to be kept confidential. Accordingly, the information owner—together with 
the information system agent controlled by him—is denoted as the defender. 

Now, the security mechanism has to invariantly enforce a suitable version of the 
following security policy of (possibilistic) inference-proofness, which also specifies the 
attacker model: For each prohibited piece of information y, the information content of 
messages sent to the attacking client agent—which is possibly enhanced by reasoning 
capabilities supplied by the human communication partner—during executions of 
agreed interaction protocols will never enable the attacking receiver to rationally infer 
that y is valid in the epistemic state, even when 
— inspecting the complete history of preceding interactions, 

- considering some a priori knowledge about the possible epistemic states, 
-— applying the semantics of the agreed interaction protocols, and 
— being aware of the functionality of the security mechanism. 


The concept of rationality on the side of the attacker is then captured by the following 
rephrasing of the still to be suitably versioned security policy of inference-proofness in 
terms of indistinguishability (as roughly visualized by Figure 6.3): 
For each prohibited piece of information yw, 
for each epistemic state d satisfying the a priori knowledge, 
for each sequence of messages mes1,..., Mes, 
exchanged during an interaction history 
complying with the agreed interaction protocols 
but potentially altered by the security mechanism 
there exists an “alternative” epistemic state d’ such that 
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Fig. 6.3: A rough visualization of the requirement of inference-proofness: “for each w, for each d, for 
each mesi,..., MesSx, there exists d’ such that...”. 


1. the same sequence of messages would be generated, in compliance with the agreed 
interaction protocols and subjected to the alterations by the security mechanism, 
but 

2. Wisnotvalidind. 


For this rephrasing, the epistemic state d is thought as actually derived by the infor- 
mation system agent—for short, “stored”—and might satisfy the prohibited piece of 
information w or not. The former case implies that the alternative state d’ required to 
exist is different from d; in the latter case, the actually stored state d and the alternative 
state d might be the same. Accordingly, declaring y as a prohibition does not intend to 
block any option that enables the attacker to infer the non-validity of w. 
Confidentiality as inference-proofness could be trivially achieved by granting no 
access permissions at all or altering the information content of all messages sent to 
the attacker to nothing, violating any conflicting availability requirements and shutting 
down any communication mediated by the respective computing agents and, thus, mak- 
ing the whole thing useless. Accordingly, confidentiality requirements and availability 
requirements always have to be suitably balanced. 
All our work focuses on the following three-level conflict resolution strategy: 
1. Asa general rule, some dedicated access permissions are granted for the sake of 
availability, to be freely enjoyed by the client agent, insofar as they do not conflict 
with level 2 of the strategy. 


398 —— 6 Privacy 


2. As exceptions from the general rule, specific prohibitions are declared for the sake 
of confidentiality that never must be violated and, thus, these prohibitions have to 
be enforced by alterations made by the security mechanism, but to comply with 
level 3 of the strategy only insofar as definitely necessary. 

3. Asalimitation for the effect of exceptions, the alterations made have to be minimal, 
again for the sake of availability. 


Given the access permissions on the first level, the second and the third level lead toa 
combination of a constraint solving problem and an optimization problem. 

Our simplified defender-attacker agent model still allows many instantiations and 
versions, respectively. For our concrete ongoing research, and accordingly for this 
article, we grossly distinguish the structures of three fundamental data types for an 
information system, namely abstract data sources, propositional knowledge or belief 
bases, and first-order relational databases. 

For each of these data types—considering suitable refinements—we deal with the 
pertinent operations, which in our case are the interactions with a client agent, com- 
prising in all cases at least 
-—  closed-query /yes—no-query evaluation with response preparation, 

performed repeatedly with queries that in general are different; 


and, depending on the refinement, additionally 
—  open-query evaluation with response set preparation, 
performed repeatedly with in general different queries; 
— view generation, 
performed only once, since the attacker can freely employ a received view at his 
discretion instead of contacting the defender again; 
— view updating, 
possibly performed from time to time, if manageable at all; 
- knowledge update transaction, 
performed repeatedly usually intertwined with queries; 
— belief revision, 
performed repeatedly usually intertwined with queries evaluated under non- 
monotonic reasoning; 
— procedural program execution, 
performed repeatedly with in general different input parameters; 
— data outsourcing, 
performed only once. 


Tailored to the respective refinement, we propose and study alterations to the available 
interactions to ensure inference-proofness. There are two basic approaches to alter- 
ations, namely weakening the pertinent information about the actual epistemic state 
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and lying about the pertinent information about the actual epistemic state, allowing or 
even requiring appropriate refinements and combinations of weakening and lying. 


6.1.3 A Generic Construction Methodology for Alterations 


We can design and verify basic kinds of alterations according to the following generic 
construction methodology, which refers to the eight features of the framework de- 
scribed in Section 6.1.2. According to Feature 1, at each point in time, the defending 
information system agent is privately deriving its actual epistemic state. According to 
Feature 3, this state is then taken as the basis for the data contained in the messages 
to be sent to the attacking client agent during an interaction complying with some 
agreed protocol, for which some dedicated access permissions are granted. According 
to Feature 5, the attacker is suspected to aim at gaining as much information as possible 
about the defender’s actual epistemic state, in particular whether a prohibited piece 
of information is valid in that state. In general, however, the attacker will face some 
uncertainty about that state, since (i) by Feature 2, there are no direct communication 
acts between the human individuals involved and, (ii) by Feature 6, the attacker has 
been separated from the defender’s underlying information processing by the shield of 
the implanted security mechanism and the interactions are restricted to the exchange 
of messages. According to Feature 4, the attacker’s uncertainty should always include 
that any prohibited piece of information might be not valid in the defender’s actual 
epistemic state. 

Conceptually, the attacker’s uncertainty can be captured by the set of those epis- 
temic states that appear to be possible to him. According to Feature 7, an epistemic state 
qualifies to be possible if it is compatible with both the potentially altered messages 
observed so far and the already initially available a priori knowledge and background 
knowledge. All the qualifying epistemic states together form the least uncertainty left 
to the attacker, i.e., the best achievement to satisfy his suspected curiosity: 

- exactly one of the qualifying epistemic states is the actual state; 

- allother qualifying epistemic states could possibly be the actual state as well; 

- allnon-qualifying epistemic states can definitely be excluded from being the actual 
one. 


At the point of time t, we call the set bestcv; of the then-qualifying epistemic states the 
attacker’s best current view (on the defender’s actual epistemic state). In doing so we do 
not care whether or not the attacker really achieves exactly this optimal result. However, 
on the one hand, the kind of protection wanted by us is strongly based on the presumed 
rationality of the attacker: he definitely will never miss to identify an epistemic state 
as qualifying. On the other hand, he might be too lazy to exactly identify all actually 
non-qualifying epistemic states. In other words, the attacker might work with either 
the best current view bestcv; or any superset of it. Under this condition, the security 
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Fig. 6.4: A rough visualization of a pertinent security invariant. 


policy of inference-proofness essentially requires that the best current view bestcy; still 

reflects sufficient uncertainty, namely that for all prohibited pieces of information y 

there exists an epistemic state d’ € bestcv; such that y is not valid in d’. 

Fortunately, the defender’s security mechanism does not need to determine the at- 

tacker’s best current view bestcv;, and the mechanism would not be able to carry out 

such a determination, due to the inaccessibility of the partner’s internals according to 

Feature 8. Instead, it suffices to maintain an appropriate simulated current view simcv 

that approximates the inaccessible behavior of the attacker and to enforce the following 

security invariant for all points in time t, or sometimes even a stronger one (as roughly 

visualized by Figure 6.4): 

—  simcv; C bestcv;, i.e., the defender approximates the attacker’s uncertainty from 
below, potentially underestimating but never overestimating the uncertainty; 

— for all prohibited pieces of information y there still exists an epistemic state d’ € 
simcv; such that w is not valid in d. 


Moreover, in general the security mechanism will work with some concise (i.e., algo- 
rithmically treatable) representation rep(simcv) and algorithmically check the security 
invariant in terms of the pertinent representation. 

Now, such a simulation would then be initialized at the pointin time t = 0 by setting 
rep(simcvo) according to a suitably concise representation of the (assumed) a priori 
knowledge. This requires the natural security precondition that the a priori knowledge 
does not violate any prohibition. In fact we cannot prevent the attacker from “learning” 
what he is already sure of. 

Inductively, at the point in time t + 1, the simulation should contain a suitably 
concise representation rep(simcv;) ofan appropriate set simcv of epistemic states. Then, 
for the functionally correct messages to be sent to the client agent—and in some cases 
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Fig. 6.5: A rough visualization of the basic steps of inductively enforcing a security invariant. 


3 abort tentative update 


for some suitably determined modifications as well—the security mechanism first has to 

inspect the consequences of tentatively updating rep(simcv;) accordingly, i.e., whether 

or not the security invariant would be violated. In case of a violation, the security 

mechanism then has to identify suitable alterations that definitely avoid the violation. 

Finally, the possibly altered messages are actually sent out, and the simulation is 

actually updated accordingly, now ensuring the security invariant. Figure 6.5 visualizes 

the basic steps of these actions though the form is not universally applicable. 

So we are left with the most crucial points of our construction methodology: 

1. maintaining a convenient data structure for a “suitably concise representation 
rep(simcv;) of an appropriate set simcv; of epistemic states”; 

2. checking tentative updates for violations of the security invariant, preferably aiming 
at the approximately best computational complexity; and 

3. as far as required, efficiently identifying suitable alterations, preferably without the 
need for a further explicit violation check. 


Moreover, each specific situation dealt with might require some variations, in partic- 
ular regarding the points in time to be considered. Let us first consider situations for 
which the defender’s epistemic state is kept fixed over the time. If the sequence of 
interactions to be treated consists of only instantiations of the most basic interaction— 
closed-query/yes—no-query evaluation with response preparation—then the successive 
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points in time are essentially determined by the defender receiving the next query, 
which asks for the validity of a single piece of information in the defender’s fixed 
epistemic state. 

If the sequence of interactions also contains instantiations of the more advanced in- 
teraction of open-query evaluation with response-set preparation, then the response 
set can be formed by iteratively inspecting an internally generated sequence of closed 
queries, each of which is either a ground substitution of the open query (i.e., obtained 
by replacing free variables by constant symbols) or a specifically defined completeness 
sentence (dealing with “negative information”). Accordingly, for the point in time of 
receiving the open query, basically shared with the attacker, the defender privately 
determines a sequence of subpoints in time. Similarly, if the sequence of interactions 
also contains an instantiation of view generation for the fixed state, a response can 
be formed by internally inspecting a somehow exhaustive sequence of closed queries, 
also leading to a private sequence of subpoints in time. Alternatively, a view can be 
generated by inspecting suitably defined open queries, as explained above, leading to 
the respective private sequences of subpoints in time. 

In general, however, the epistemic state of the defender might change over the time, 
raising additional issues, especially when it comes to finding an inference-proof way 
for 
4. resolving conflicts between confidentiality and integrity, and 
5. ensuring backwards confidentiality. 


Basically, a confidentiality—integrity conflict might occur if the request for a change 
of the defender’s epistemic state originates from the attacker. In general, with regard 
to pure functionality, the information system agent has to perform a transaction that 
first tentatively changes the state as requested and then checks whether or not the new 
state complies with all semantic constraints that are declared to be maintained as an 
integrity invariant according to the underlying application and, thus, assumed to be 
part of the a priori knowledge; for a relational database such a declaration might be part 
of the database schema, but in all cases such a declaration might also be expressed 
externally. In case of compliance, the transaction is committed making the tentative 
change persistent; otherwise, in case of non-compliance, the transaction is aborted, 
recovering the previous state. In both cases the requester is notified accordingly. 

However, with regard to a security policy of inference-proofness, the respective 
functionally correct notification can be seen as a response set to one or more appropri- 
ately constructed closed queries. This response set might enable the attacker to learn 
the validity of some piece of information contained in the prohibition declaration, i.e., 
incorporating the response set to the (concise representation of the) simulated current 
view would violate the required security invariant. Consequently, the resulting conflict 
has to be resolved by either suitably modifying the transaction functionality or altering 
the response set leading to the actually returned notification, as already outlined for 
queries, or combining both activities. 
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In doing so, we also have to account for an implicit information flow that is caused by 
the overall control flow structure of a transaction as a guarded command in the form of 
an if-then-else branching: knowing the transaction semantics and being fully aware of 
the security mechanism, if the attacker can infer which branch has been selected, then 
he can also figure out whether and how the epistemic state has actually been changed 
and which alterations have actually been made to the functionally correct notification. 

More generally, the defender’s side of an interaction might have a more or less 
sophisticated overall control flow structure stemming from guarded commands like 
if-then-else branching, repeat-repetitions, while-repetitions, and similar procedural com- 
mands that can cause implicit information flows. Then the security mechanism typically 
inherits this potentially critical control flow structure. Moreover, the (code of the) se- 
curity mechanism itself might have a critical control flow structure. Accordingly, the 
security mechanism has to appropriately treat an attacker’s potential observation of 
a control path that has actually been chosen during a (hidden) execution on the de- 
fender’s side and the resulting implicit information flow, similarly as the responses 
to explicit queries. As outlined above for transactions, the treatment might include to 
interpret such an observation as an implicit query. 

The issue of backwards confidentiality results from the following observations about 
the consequences of an update of the epistemic state. First, previously released infor- 
mation about the validity of a prohibited piece of information might become outdated 
and, at least in general and whenever possible in an inference-proof way, the defender 
should suitably inform the attacking partner about the occurrence of an update and 
send him a pertinent refreshment of outdated information. Second, it can be shown that 
such a notification together with the refreshment and further information contained 
in messages sent at subsequent points in time might enable the attacker to infer the 
validity of a prohibited piece of information in the past at some preceding point in 
time. Accordingly, we have to strengthen the security policy of inference-proofness by 
requiring continuous inference-proofness to be enforced for the full range of all points 
in time that so far have happened, rather than just for the respective last one. This 
goal can actually be achieved by checking tentative updates of the representation of a 
simulated current view for stronger versions of the security invariant. 

Basically, the general construction methodology for alterations proceeds itera- 
tively, whether the points in time considered are externally determined by observably 
receiving/sending a message or only privately generated. However, for an interaction 
expected to be performed only once, in particular for view generation and data out- 
sourcing, there could be only one point of time of interest, and thus one might look for 
a security mechanism that is working “more globally”. Indeed, for view generation we 
have successfully designed and verified such a mechanism, in addition to the iteratively 
working ones inspecting sequences of appropriately formed queries, as mentioned 
before. Furthermore, the security mechanism for outsourcing data does not rely on such 
sequences. Nevertheless, these cases are also inspired by the approach to conceptually 
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set up some simulated counterpart to the best current view, possibly represented in 
some more manageable way. 


6.1.4 Specific Constructions for Alterations 


6.1.4.1 Weakening Including Refusing 
The weakening approach to alterations of a harmful message comprises the special 
case of refusing to provide any explicit information flow as its extreme form. 

In terms of a complete logic a refusal can literally be represented as the tautology 
XV -X, known by a rational attacker right from the beginning without having performed 
any interaction with the defender. However, literally replacing y by the tautology y V =x 
simply only if the actual validity of y is harmful, without additionally caring about the 
potential harmfulness of its non-validity, equivalently by completeness, the fictitious 
validity of ~y, would trigger an implicit information flow by means of the following kind 
of meta-reasoning, which exploits the postulated background knowledge about the 
security mechanism: 


For a (definitely flawed) security mechanism that, while inspecting a valid sentence y, refuses on 
the harmfulness of y but not on the harmfulness of the fictitious validity of sy, we would have 
the following equivalence: refusing occurs if and only if y is valid and harmful. Thus, observing a 
refusal, the attacker could infer the validity of x. 


Notably, this kind of reasoning would just be caused by the careless handling of the 
critical control flow structure in the form of an if-then-else branching. 

Accordingly, to ensure the required inference-proofness, the security mechanism 
has to make the two possible cases of validity and non-validity indistinguishable for 
the attacker, in the simplest way by refusing if and only if at least one of the cases is 
harmful. 

In some situations weakening can also be achieved by using more general disjunc- 
tions. For example, let both Yı and w2 be prohibited pieces of information in isolation 
but not the disjunction Y1 V Wo, i.e., knowing the validity of Yı V Y2 in the actual 
epistemic state is considered to be harmless, but figuring out which of the two disjuncts 
leads to the validity is harmful. Appropriately taking care of options for meta-reasoning 
similarly as for pure refusing and, additionally, of the potential entailments among 
several such disjunctions and of such disjunctions with other pieces of information, we 
might ensure inference-proofness by replacing a valid information y; by the disjunction 
Yi V Wo. 

For all such kinds of weakening, the literal representation of weakened information, 
as conveyed in messages to the attacker, 

- correctly reflects the actual epistemic state of the defender, and 
— notifies the attacker about the fact of an actually performed weakening. 
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In terms of a logic, the correctness property allows us to solve the first three crucial 

points for the construction methodology roughly as follows: 

1. A simulated current view is directly represented by the set of those sentences that 
so far are known to be valid in the defender’s epistemic state, according to the a 
priori knowledge and the literal contents of messages. 

2. A tentative update is independently performed with both the inspected sentence 
and its negation, respectively, and both versions are checked for violations of the 
security invariant by solving implication/entailment problems of the form “[current 
set of sentences together with tentatively added sentence or its negation, respec- 
tively] entails [sentence designating a prohibition]” by a pertinent theorem prover. 

3. Ifany ofthe checks is positive, refusing is straightforwardly identified as the suitable 
alteration (or, possibly, a less easily defined but more informative disjunction that is 
stronger than a tautology but still harmless), and then the alteration can be notified 
in the corresponding message, leaving the current representing set of sentences 
untouched (or updating that set by adding the identified disjunction). 


6.1.4.2 Lying 

In contrast to weakening, the lying approach to alterations requires a sharp distinction 
between the literal representation of responses and the attacker’s rational sophisticated 
conclusions about what he is literally observing. 

To start with literal representations, consider for example a directly prohibited 
piece of information w in terms of a complete logic. The security mechanism would 
always have to pretend literally the non-validity of y. Hence, if the attacker is aware of 
the security mechanism including the prohibition declaration, he would not need to 
query any prohibition y: he will always receive the literal response that ~y is valid. 

This feature implies that the defender cannot expect the enforcement of a declara- 
tion treating both w and = as prohibitions. Less obvious is a further consequence: even 
if the defender declares several pieces of information Y1, . . . , Yx as individual prohibi- 
tions, nevertheless under lying the disjunction Y1 V --- V Yx has to be protected literally 
as well. For otherwise the attacker could perform the following inconsistency-reasoning: 


For a (definitely flawed) security mechanism that only lies literally on the (explicitly declared harm- 
ful) validity of %1, . . . , Yx but not on the validity of pi V + -+ V Wx, we could get a literal represen- 
tation of responses that contains the following inconsistent set of sentences: {-1,..., -Wx, Wi V 
--- V Wx}. Thus, observing such an inconsistency, the attacker could identify the occurrence of 
some lying in the literal representation and, furthermore, could be tempted to infer the validity of 
any sentence. 


Accordingly, to ensure the required inference-proofness, the security mechanism better 
has to avoid running into such a somehow “hopeless” situation. 

Moreover, regarding rational sophisticated conclusions about a possibly lied as- 
sertion about the non-validity of a prohibition y and the literal representation of this 
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assertion as =, the best current view bestcv would contain pairs of epistemic states 
that differ (at least) in one state making the sentence w valid and the other state making 
the negated sentence ~y valid. So, evidently, a concise representation rep(simcv) of 
a simulated current view simcv containing only a sentence ~y that is possibly but not 
necessarily a literal lie would represent both alternatives (as already said above, in 
contrast to the properties of weakening). 

The two issues have a conceptually simple and provably effective solution. First, 
we replace a prohibition declaration consisting exactly of the single pieces of infor- 
mation pj ,..., Yx by the singleton containing one disjunction p; V +++ V Yy. Such a 
replacement considerably strengthens the security policy of inference-proofness: for 
each epistemic state d under consideration, the existence of an “alternative” state d’ is 
required such that all of the y; are simultaneously not valid in d’. Second, we forma 
concise representation of the simulated current view by the literally provided responses, 
being aware that the same literal representations for weakening and lying, respectively, 
essentially differ in their semantics. 

Of course, similarly as for weakening but somewhat more subtly, lying is also due 
for any harmful sentence x that could lead the attacker to believe in the validity of the 
disjunction of all Y contained in the prohibition declaration. 

In summary, for all kinds of lying, the literal representation of lied information, as 

conveyed in messages to the attacker, 

- might not correctly reflect the actual epistemic state of the defender and, thus, 
might seriously mislead a naive receiver, and, 

— naturally, does not notify the attacking receiver about the fact of an actually per- 
formed lying and, thus, lays the burden on the receiver of finding out whether or 
not some lying has potentially occurred. 


Moreover, in terms of a logic, the first three crucial points for the construction method- 

ology are solved roughly as follows; 

1. A simulated current view is indirectly represented by the set of those sentences that 
so far have been pretended literally to be valid in the defender’s epistemic state, 
according to the a priori knowledge and the content of messages. 

2. A tentative update is checked for violations of the security invariant by solving one 
implication/entailment problem in the form “[current set of sentences together 
with tentatively added sentence] entails [sentence designating the disjunction of 
all prohibitions]” by a pertinent theorem prover. 

3. Ifthat check is positive, a lie on the tentatively added sentence is straightforwardly 
identified as the suitable alteration and then sent (without notification, of course) 
in the corresponding message, and that lie is inserted into the current representing 
set of sentences. 
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6.1.4.3 Combined Approaches 
As we have seen, alterations by refusing on the one hand and by lying on the other 
are precipitated by different causes. Actually refusing is needed if the validity or the 
non-validity of the sentence inspected would be harmful regarding a single prohibition. 
Actually lying is needed if the validity of the inspected sentence would be harmful 
regarding the disjunction over the prohibition declaration. We might wonder whether 
and how we can do better by avoiding to always check the impact of both the validity and 
the non-validity of the sentence inspected and by never considering the strengthened 
security policy to protect the disjunction. This goal can be achieved by a suitable 
combination of refusing and lying: 

— if the validity of the sentence inspected is not harmful regarding any prohibition, 
then return a message without any alteration; 

— if the validity of the sentence inspected is harmful regarding some prohibition 
and also the non-validity of the sentence inspected is harmful regarding some 
prohibition, then return a message suitably indicating a refusal; 

—  ifonly the non-validity of the sentence inspected is harmless, then return a message 
suitably altered by the lie that literally pretends the non-validity. 


Moreover, we can still represent a simulated current view by the set of those sentences 
that so far have been pretended literally to be valid. 


6.1.4.4 Weakening versus Lying 

At first glance, alterations by lying appear to be rather problematic, both from an ethical 
point of view and regarding the desired functionality. However, while lying is ethically 
banned in general, we all know of widely accepted exceptions, e.g., a white lie ina 
most critical situation, a small insincerity to avoid some larger offense, or an untruthful 
answer to an illegal request. 

Moreover, in many cases literally lying might also functionally disturb the agreed 
cooperation between the communication partners, insofar as the receiver is behaving 
naively and “believing the lies” without further own reasoning. Even then, however, if 
an actually occurring alteration by literally lying does not affect an implicit or explicit 
availability declaration, we might argue that the respective interaction is at least beyond 
the agreement or even misusing it. 

Finally, handling an interaction that is intended to change the defender’s epistemic 
state by means of a transaction, we might face a confidentiality—integrity conflict for 
which an application of literal lying appears to be most natural, at least if the ethical 
and functional concerns are appropriately dealt with. 

Above, we already distinguished between “lying” and “literal lying”, and a deeper 
inspection of the issues indicates that such a distinction is crucial for a more informa- 
tion-theoretical discussion. To start with, for any kind of alteration, a sophisticated 
attacker can always distinguish between the conceptual notion of the best current view 
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and the pragmatically introduced notion of the simulated current view and its concise 
representation. 
The best current view bestcv can be seen as a kind of an inverse image, as de- 
rived from the observed message history mes), ..., mes; and denoted by bestcv = 
con_mess~‘(mesj,..., mes x) . Here, we think of the message history being composed 
of k many request messages meSq,,..., MeSa y received from the attacker, and k” many 
reaction messages mesy,,...,mesq,, returned to the attacker, with k = k +k’, where 
the latter were produced for the actual epistemic state by the defender’s message- 
generating function possibly applying alterations, called con_mess. 

Exploiting the postulated background knowledge, the attacker can thus determine 
bestcv in mathematical terms, and insofar as con_mess is an effectively computable 
function, i.e., its graph 


{ (es, hist) | es is epistemic state, 
hist composed of received hista and returned hist, is message history, 


con_mess(hista, es) = hista} 


is recursively enumerable, the graph of the inverse function con_mess"! is recur- 
sively enumerable as well and, thus, for each message history hist the inverse image 
con_mess (hist) is also a recursively enumerable set. Notably, under this perspective, 
there is no conceptual difference between refusing and lying, or between any other 
approach to alterations. 

The concise representation rep(simcv) of the simulated current view simcv is a techni- 
cal means employed by the defender’s security mechanism to effectively—and hopefully 
also efficiently—enforce the pertinent security invariant. However, exploiting the pos- 
tulated background knowledge, the attacker can determine rep(simcv) as well. In fact, 
as outlined above, for weakening and lying that representation is just formed by the 
literal messages sent to the attacker. 

Hence, in information-theoretical terms, the main difference between weaken- 
ing (with the special case of refusing) on the one hand and any method that at least 
sometimes literally lies on the other hand can be described as follows: 

— under weakening the representation of the simulated current view also represents 
the best current view in a straightforward way, 

— whereas under lying the burden of the attacker to determine the best current view 
might be much harder. 


In other words, for a sophisticated and rationally reasoning attacker, (i) there are no 
“real lies” but only literally ones, and (ii) lying causes an essential difference between 
what he can literally observe and what he can conceptually conclude, and in general 
imposes a high computational complexity on algorithmically determining the best 
current view as the least uncertainty. 
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Amore detailed comparison of refusing and lying for a sequence of queries considers the 
longest prefix for which the responses have been correct (for short, called “longest hon- 
eymoon”). In turns out that refusing and lying are in general incomparable regarding 
this notion. 

Somehow surprisingly, however, if we require to protect the disjunction over the 
prohibition declaration also for refusing, the information contents supplied by the two 
approaches are exactly the same, i.e., the conceptual best current views are always 
equal or, in other words, two epistemic states are indistinguishable for the attacker in 
case of refusing if and only if they are so in case of lying. Moreover, we can show that 
an actual refusal occurs if and only if a potential lie occurs. This result shows again 
that the information-theoretical difference between refusing and lying consists in the 
computational burden of the attacker to find out the grade of reliability of a message 
received: for refusing the attacker is explicitly notified by the defender; for lying the 
attacker has to find out by himself by means of rational reasoning (or just by simulating 
a defender that applies refusing). 


6.1.5 Managing Computational Complexity 


The overall computational complexity of inference-proof interactions is basically deter- 
mined by the normal functionality of the information system under protection on the 
one hand and the overhead caused by the security mechanism on the other. Regarding 
the impact of the former, as far as applicable, the pertinent logic underlying the query 
evaluation appears to be most crucial. Regarding the impact of the latter, both the 
consideration of the interaction history (including the assumed a priori knowledge) by 
maintaining a concise representation of the simulated current view and the number 
and the kinds of checking tentative updates for satisfying or violating, respectively, the 
security invariant are most important. As far as applicable, the pertinent logic once 
again determines the costs. 

In general we can expect a rather high or even practically infeasible level of com- 
plexity, and in some cases also beyond effective computability. Moreover, besides first 
of all treating the constraint-solving problem to achieve inference-proofness for the sake 
of confidentiality, additionally we always aim at still providing good availability and 
thus are facing the optimization problem to actually perform an alteration only if strictly 
needed. Each of the problems alone is known to be computationally hard in general, 
so will be their combination. 

Whether or not a high computational complexity can be afforded might depend 
on the additional timing constraints of the desired communication acts. In particu- 
lar, if the (attacking) communication partner expects to be served by the (defending) 
information owner online in real time, only a minor delay would be acceptable. By 
contrast, insofar as the interaction of view generation is initialized by the (defending) 
information owner, all the computations can be done offline and thus might last as 
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long as several hours. Accordingly, as a general heuristic we favor the shift of suitable 
parts of the overall computational burden to offline precomputations. The duration 
does not matter the communication partner at all or is bounded by the time normally 
spent by an interactively communicating partner for some other activity. 

Given that at the core of any security mechanism we have to check the tentative 
updates of the representation of the simulated current view in terms of an underlying 
logic for complying with a security invariant, we can attempt to decrease the number and 
to lessen the complexity of such checks by restricting to special cases of the sentences 
used for formally expressing a priori knowledge, queries, and prohibitions. The best 
case would be that it suffices to relate (the validity of) a sentence inspected to the 
prohibitions in a straightforward way without the need to consider a simulation and 
thus the interaction history at all. Intuitively, this case could arise if both queries and 
prohibitions refer to elementary and mutually independent pieces of information. 

More generally, the following guidelines for identifying computationally efficient 
cases have been successful: diminish the potential mutual dependence of the consid- 
ered pieces of information about the defender’s epistemic state and, thus, the redun- 
dancy contained in that state; and syntactically restrict the sentences expressing such 
pieces of information such that the pertinent logical entailment problems are easily 
solvable. In fact, as a first example and referring to the best case regarding closed-query 
evaluation, checking tentative updates can be done without considering the interaction 
history, and the entailment problems can be reduced to simple text comparisons under 
the following conditions: epistemic states are represented by relational instances of a 
relational schema with functional dependencies in Object Normal Form, i.e., they are in 
Boyce-Codd Normal Form and satisfy the Unique Key Property, and the epistemic states 
contain only atomic sentences, i.e., logical representations of single tuples. Further 
examples dedicatedly relax these requirements such that checking tentative updates 
can efficiently be implemented by means of SQL, even for restricted cases of open-query 
evaluation. 

We can also employ a wide range of approximation heuristics to decrease the com- 
putational complexity, first of all by relaxing the availability requirements in order to 
facilitate the resulting optimization problem to minimize alterations. 


6.1.6 Conclusions: Naive Illusion or Promising Hope? 


Our main motivation has been to design and mathematically verify technical solu- 
tions to the social issue of preserving privacy, or of any other justified confidentiality 
concerns. Clearly, in general these goals require the consideration of a large range of 
psychological, social, institutional, legal, information-theoretical, and mathematical 
features. In public discussions about privacy preservation in the Information Age, some 
voices even claim that achieving privacy has become illusionary. Without discussing 
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arguments in detail, we just state that we do not share this view, as do, we believe, the 
legal bodies issuing the pertinent legislation. 

The framework underlying our achievements focuses on a very narrow aspect 
of enforcing confidentiality: how to support a human information owner in hiding 
the validity of dedicated pieces of information, referred to as declared prohibitions, 
contained in the internal epistemic state of an information system agent under the 
owner’s control, while that agent interacts with the client agent run by a semi-honest 
and rationally reasoning communication partner. Thus, the overall target of protection 
is only the internal state of a technical device whose well-defined interface to the outside 
world is supposed to be configurable and mastered at the discretion of the controlling 
human individual. 

In the extreme case, that interface can just be totally disabled such that—under 
reasonable assumptions—no human can observe the internal state of the information 
system agent at all, thus preventing the availability of any information. So, conceptually 
starting with a disabled interface, the real problem is to gradually allow a flow of 
information from the internals of an otherwise completely shielded computing agent, 
while still guaranteeing the wanted kind of confidentiality of the declared prohibitions. 
The other way round, conceptually starting with a totally open interface, the real 
problem is to minimally confine such a flow of information, until the wanted kind of 
confidentiality according to the declared prohibitions is achieved, thus still preserving 
a maximal availability. 

We want to emphasize that we are not aiming at anything more. Neither do we 
want to confine the information owner in chatting about what he has in his human 
mind, nor do we want to hinder the communication partner in observing real-world 
facts, nor to prevent him from exploiting any further information source. We care only 
about the conceptual information flow from the internal state of a technical device to 
an interacting computing agent based on protocol-complying exchanges of messages. 
Clearly, the occurrences of such flows might depend on additional circumstances, 
as captured by our assumptions and postulates about the communication partner 
regarding his a priori knowledge and his background knowledge, respectively. 

So, our achievements are as promising as these assumptions and postulates are 
realistic and, furthermore, all the other features left aside by us can also be suitably 
dealt with. 


6.1.7 Selected Bibliographic Notes 


The study of inference-proof interactions of a logic-oriented information system started 
with two contributions about the interaction of closed-query evaluation with response 
preparation. Sicherman, Jonge, and Riet [611] suggested the refusing approach early, and 
Bonatti, Kraus, and Subrahmanian [96] later introduced the lying approach. Following 
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a first attempt to compare the two approaches, Biskup and Bonatti [78] set up a unifying 
framework for further developments. 

Biskup and Bonatti [75, 76] introduced the combination of refusing and lying, for 
closed queries. Furthermore, Biskup and Bonatti [77] treated open-query evaluation 
with response-set preparation for a decidable relational submodel; considerably much 
later, Biskup, Bring, and Bulinski [79] reported on a partial prototype implementa- 
tion with some optimizations. Among other attempts to restrict the expressibility of 
relational a priori knowledge, queries and prohibitions to enable inference-proofness 
in the spirit of access control by means of SQL only, Biskup, Embley, and Lochner 
[71] identified the impact of relational database schema normalization. Biskup and 
Weibert [86] extended all three approaches to alterations of responses to closed queries 
to an underlying incomplete propositional information system, which offers a third 
option (don’t know) in addition to yes and no. Biskup, Gogolin, Seiler, and Weibert 
[81] added knowledge update transaction as a further interaction under lying, later 
also treated for refusing. Moreover, Biskup and Tadros [84] investigated the impact of 
non-monotonic reasoning for the interaction of belief revision. Biskup and Wiese [87] 
studied a concept of view generation as yet another interaction under lying—essentially, 
a as a combination of a restricted first-order logic satisfiability problem and a mini- 
mization problem. Later Biskup, Dahn, Diekmann, Menzel, Schalge, and Wiese [80] 
presented a prototype implementation exploring several heuristic optimizations and 
approximations. Biskup and Preuß [83] invented another method for view generation, 
later also extended to view updating, essentially based on weakening by means of 
disjunctions of prohibitions. Biskup, Tadros, and Zarouali [85] explored how to handle 
interactions expressed as procedural program executions in an inference-proof way, in 
particular exploiting methods of language-based security aiming at the security policy 
of non-interference. Biskup and Preuß [82] analyzed the fragmentation approach to 
secure data outsourcing. Finally, Biskup [73] studied the interaction of closed-query 
evaluation and view generation in the framework of abstract data sources, already 
presenting the framework reused in Section 6.1.2. 

Some of these developments are discussed by Biskup [74]. Our narrower topic is 
embedded in some more general streams of research and related to many other specific 
topics, as outlined in Biskup [72]. Guarnieri, Marinovic, and Basin [243] and Halpern 
and O’Neill [255] are examples of taking a wider perspective, that includes probabilistic 
considerations. 
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